Searching Musical Audio Datasets by A Batch of Multi-Variant Tracks
Searching Musical Audio Datasets by A Batch of Multi-Variant Tracks
Searching Musical Audio Datasets by A Batch of Multi-Variant Tracks
Multi-Variant Tracks
ABSTRACT Keywords
Multi-variant music tracks are those audio tracks of a par- Content-based audio retrieval, cover songs, musical audio
ticular song which are sung and recorded by different peo- sequences summarization, hash-based indexing
ple (i.e., cover songs). As music social clubs grow on the
Internet, more and more people like to upload music record- 1. INTRODUCTION
ings onto such music social sites to share their own home-
produced albums and participate in Internet singing con- The World Wide Web (WWW) has become much more
tests. Therefore it is very important to explore a computer- lively musically speaking since we can upload our own audio
assisted evaluation tool to detect these audio-based multi- or video records onto the Internet and share them with the
variant tracks. In this paper we investigate such a task: world on such popular web sites as http://www.midomi.com,
the original track of a song is embedded in datasets, with http://www.yyfc.com and http://www.last.fm. Often, these
a batch of multi-variant audio tracks of this song as input, music social clubs provide an online forum for Internet singing
our retrieval system returns an ordered list by similarity contests to attract people to vote for their favorite songs. Be-
and indicates the position of relevant audio track. To help sides people’s subjective comments, the computer-assisted
process multi-variant audio tracks, we suggest a semantic in- locating of a song version is also of great importance to
dexing framework and propose the Federated Features (FF) us. When a particular song is sung and recorded by dif-
scheme to generate the semantic summarization of audio ferent people, “cover songs” or “multi-variant audio tracks”
feature sequences. The conjunction of federated features for the song are produced. As a branch of query-by-content
with three typical similarity searching schemes, K-Nearest audio search and retrieval, locating and evaluating multi-
Neighbor (KNN), Locality Sensitive Hashing (LSH), and variant audio tracks is an interesting task [1]. Some inter-
Exact Euclidian LSH (E2 LSH), is evaluated. From these esting examples of multi-variant audio tracks can be found at
findings, a computer-assisted evaluation tool for searching http://www.e.ics.nara-wu.ac.jp/∼yuyi/AudioExamples.htm.
multi-variant audio tracks was developed to search over large Most of recent work [2, 3, 4] related to multi-variant audio
musical audio datasets. track detection focus on musical audio sequences compari-
son to achieve a perfect matching. However, it takes much
time to match two audio sequences since the feature sets
Categories and Subject Descriptors (Chroma [3], Pitch [5], MFCC [6]) used for audio data have
H.3.3 [Information Systems]: Information Search and Re- very high dimensionality. In this paper we review recent
trieval; H.5.5 [Information Systems]: Sound and Music research works on the popular query-by-content audio re-
Computing; J.5 [Arts Humanities]: Music trieval techniques in Figure 1. It consists of two main parts:
1) audio feature selection; and, 2) audio sequence matching
General Terms approaches. As shown in Figure 1 Dynamic Programming
(DP) [3, 5] is a very important matching method, that per-
Algorithms, Performance, Experimentation
forms exhaustive audio sequence comparisons which obtain
good exactness. However, DP lacks scalability and results
in slower retrieval speeds as databases get larger.
Permission to make digital or hard copies of all or part of this work for We do not adopt DP in our scheme. Instead a novel
personal or classroom use is granted without fee provided that copies are weighting scheme–Federated Features (FF)–is proposed to
not made or distributed for profit or commercial advantage and that copies generate the semantic summarization of audio sequence. We
bear this notice and the full citation on the first page. To copy otherwise, to also use musicminer [7] to generate more potentially useful
republish, to post on servers or to redistribute to lists, requires prior specific audio features as candidates for the FF. According to the im-
permission and/or a fee.
MIR’08, October 30–31, 2008, Vancouver, British Columbia, Canada. portance of features used in audio sequence comparisons, we
Copyright 2008 ACM 978-1-60558-312-9/08/10 ...$5.00. are interested in the combination between frequently-used
Audio Features Similarity Matching application of LSH in large-scale music retrieval is evalu-
ated. Shingles are created by concatenating consecutive
Fourier transform [8] Locality Sensitive Hash [8]
frames and used as high-dimensional features. E2 LSH is
MFCC [6] Locality Sensitive Hash [6] then adopted to compare a query with references.
Timbre, Rhythm and Pitch [9] K-Nearest Neighbor [9]
MFCC [10] K-Nearest Neighbor [10] 2.2 LSH and E2 LSH
LSH [11] is an index-based data organization structure
STFT [4] Locality Sensitive Hash [4]
proposed to find all similar pairs of a query point in the Eu-
Chroma [3] Dynamic Programming [3] clidean space. It has been used in such applications as the
Pitch [5] Dynamic Programming [5] well-known solution to determine whether any pair of docu-
ments are similar or not (image [12], audio [6, 8], video [13],
etc.). Features are extracted from documents and projected
Figure 1: Existing audio retrieval techniques. with a group of hash functions to hash values (index point).
Feature vectors are regarded as similar to one another if they
are projected to the same hash value.
and musicminer-generated audio features. Through a large If two features (Vq , Vi ) are very similar they will have a
audio corpus we evaluate the proposed audio feature training small distance Vq − Vi , hash to the same value and fall
scheme and show the weighted audio feature summarization into the same bucket with a high probability. If they are
can effectively represent melody-based lower-level music in- quite different they will collide with a small probability. A
formation. function family H = {h : S → U}, each h mapping one point
The conjunction of the proposed Federated Feature (FF) from domain S to U , is called locality sensitive, if for any
with three typical searching schemes, K-Nearest Neighbor features Vq and Vi , the probability
(KNN), Locality Sensitive Hash (LSH), and Exact Euclid-
ian LSH (E2 LSH) [11], is evaluated. We also extend and P rob(t) = PrH [h(Vq ) = h(Vi ) : Vq − Vi = t] (1)
summarize our work as an evaluation tool to help search
is a strictly decreasing function of t. That is, the collision
musical datasets with a batch of multi-variant audio tracks.
probability of features Vq and Vi is diminishing as their dis-
tance increases. The family H is further called (R, cR, p1 , p2 )
2. BACKGROUND (c > 1, p2 < p1 ) sensitive if for any Vq , Vi ∈ S
Audio feature descriptors of musical audio sequences have if ||Vq − Vi || < R, PrH [h(Vq ) = h(Vi )] ≥ p1
very high dimensionality, which makes it increasingly diffi-
cult to quickly detect audio documents that closely resem- if ||Vq − Vi || > cR, PrH [h(Vq ) = h(Vi )] ≤ p2 (2)
ble a given input query as the number of audio tracks in A good family of hash functions will try to amplify the gap
the database increase. Solving this hard problem is mainly between p1 and p2 .
related to two points: 1) refining music representation to im- In E2 LSH the locality sensitive dimension reduction can
prove the accuracy of musical semantic similarity (pitch [5], be applied on a vector X whose each dimension follows the
Mel-Frequency Cepstral Coefficient (MFCC) [10], Chroma same P -stable distribution D. fȳk (X), the inner product
[2, 3]); and, 2) organizing music documents in the way that between X and a real vector ȳk , linearly combines all di-
helps speed up music similarity searching (LSH [4, 6, 8], mensions of X and generates a single output. With the
E2 LSH [4, 16], tree structure [9, 15]). matrix Y = (ȳ1 , ȳ2 , ..., ȳm ) an m-dimension vector fY (X) =
(fȳ1 (X), fȳ2 (X), ..., fȳm (X))T can be obtained, each dimen-
2.1 Related Work sion of which also follows the distribution D. When each
Chroma and DP are applied in [2, 3] to perform cover song dimension of Vq and Vi follows a P -stable distribution, each
detection. They guarantee the retrieval accuracy, however, dimension of fY (Vq ) and fY (Vi ) also follows the same dis-
at the cost of long sequence comparison time. In [4] LSH tribution. Then Vq and Vi can be replaced by fY (Vq ) and
and E2 LSH are used to accelerate sequence comparisons in fY (Vi ) respectively in Eq.(1-2).
query-by-content music retrieval. Yang used random sub-
sets of spectral features derived from a Short Time Fourier 2.3 This Work
Transform (STFT) to calculate hash values for the paral- This present work mainly aims to retrieve cover songs with
lel LSH hash instances in [8]. With a query as input, the a batch of multi-variant audio tracks. For this purpose, we
relevant features are matched from hash tables. To resolve introduce a semantic indexing framework for content-based
bucket conflicts, a Hough transformation is performed on music information and propose a novel semantic feature re-
these matching pairs to detect the similarity between the gression model based on fundamental similarity rules. Two
query and each reference song by the linearity filtering. groups of audio feature sets (frequently-used features in Fig-
In [9] a composite feature tree (semantic features, such as ure 1 and musicminer-generated features [7]) are respectively
timbre, rhythm, pitch, e.g.) was proposed to facilitate KNN trained via the proposed regression model. Our new Feder-
search. Principle Component Analysis (PCA) was used to ated Features (FF) set results from the combination of the
transform the extracted feature sequence into a new space trained set of frequently-used features and the trained set of
sorted by the importance of acoustic features. In [15] the ex- musicminer-generated features. A group of more meaningful
tracted features (Discrete Fourier Transform) are grouped weights are assigned to FF. In this way the musical audio
by Minimum Bounding Rectangles (MBR) and compared sequence can be summarized as the especially interesting se-
with an R*-tree. Though the number of features can be re- mantic audio representation. Searching schemes, KNN, LSH
duced, sometimes the summarized (grouped) features may and E2 LSH are implemented to evaluate FF. Moreover, we
not sufficiently discriminate two different signals. In [16] develop an audio detection system to batch multi-variant
audio queries. The experimental results show that our FF
is also very helpful for the large datasets. The set of
frequently-used
features Semantic
Audio input regression Federated
3. APPROACHES model Features
In this section we discuss the similarity searching prob- The set of (FF)
musicminer-
lems associated with the musical audio content and give the generated features
h1 (.)
definition of musical audio datasets searching by a batch hn (.)
of multi-variant tracks. To solve this retrieval problem, we
propose a new semantic-based audio similarity principle and Hash table 1 Hash table n
Bucket 1 Bucket 1
explain how to train a group of meaningful weights for our Bucket 2 Bucket 2
FF set and how to map the semantic audio representation Bucket 3 Bucket 3
to the indexable hash values. ……. …….
……. …….
……. …….
3.1 Problem Description ……. …….
Content-based music similarity searching strives to cap-
ture relevant music content/semantic-description (which may
be a song [8, 9], or an established category related to genre
[7], emotion [17], etc.) in the database with a query exam- Figure 2: Federated features based indexing frame-
ple, where the query example (from some defined description work.
of music) is similar [8, 9] or categorized [7, 17] to the refer-
ence/provided content/attribute according to some similar-
ity criteria. Content-based music similarity can be defined
3.2 Semantic Indexing Framework
over a wide continuum since it is related to search intention, The indexing framework for semantic musical audio rep-
musical culture, personal opinion and emotion, etc. This resentation includes two major parts: 1) effectively gener-
work takes into account a similarity retrieval mechanism as ating semantic feature summarization; and, 2) accurately
follows: 1) taking a batch of multi-variant tracks as audio calculating hash values. Figure 2 demonstrates a seman-
inputs (which are originally from the same popular song but tic indexing framework based on federated features. Audio
recorded by different people); 2) performing LSH/E2 LSH feature representation is a very important component for de-
and KNN searches; 3) returning the ordered audio tracks signing the content-based audio retrieval engine, especially
list; and, 4) evaluating the retrieval results by relevance. when there are multi-variant audio queries. To achieve a
For each group/batch of multi-variant audio tracks, there better semantic audio representation, we select the set of
is only one relevant audio track in the test datasets. As frequently-used audio features introduced in Section 3.3 and
introduced in Section 1, this is an essential component for the set of musicminer-generated audio features [7] as candi-
browsing, searching and evaluating the musical audio-based dates to train a group of weights by our semantic regression
content information on the Internet (or personal digital me- model. For each audio track the summarized semantic FF is
dia player). calculated and mapped to hash values. The FF features of
To maintain notional consistency throughout the paper, audio tracks in the datasets are organized by LSH/E2 LSH
the general description of key symbols in this work is given in hash tables. The FF of an audio query can be calculated
here. Given a collection of songs R = {ri,j : ri,j ∈ Ri , 1 ≤ and mapped to one fixed hash value (or directly perform
i ≤ |R|, 1 ≤ j ≤ |Ri |}, (ri,j is the j th spectral feature of exhaustive KNN searching). If hash values of two FF col-
the ith song Ri ), the feature sequence ri,j of the ith song Ri lide into the same bucket, it would mean that they are very
is summarized to Vi , an n-dimension feature vector in the similar with a high probability.
Euclidean space (Vi ∈ n ). The summarized feature Vi in-
stead of the feature sequence ri,j can then be utilized in the 3.3 Federated Features
retrieval stage. To further accelerate the retrieval speed, Audio documents can be described as time-varying feature
hash-based indexing is also adopted. Each hash function sequences. Directly computing the distance between two
hk (·) maps Vi to a single value. Then the group of N in- audio documents (matching audio documents) is one impor-
dependent hash functions h1 (·), h2 (·), ..., hN (·) generate a tant task in implementing query-by-content music retrieval.
vector of hash values H(Vi ) = [h1 (Vi ), h2 (Vi ), ..., hN (Vi )]T . DP can be used in matching two audio feature sequences
Inside a hash instance each hash vector is assigned to a and is an essentially exhaustive search approach (which of-
bucket and two summarized features with the same hash fers high accuracy). However it lacks scalability and results
vector fall into the same bucket. L parallel hash instances in slower retrieval speeds as databases grow larger.
are constructed to support a high recall. Given a query song To quicken the audio feature sequences comparison and
Q (with the summarized feature Vq ) and a similarity func- obtain the scalable retrieval, semantic features are extracted
tion ρ(·, ·), we would like to compute the similarity degree from the audio structure or based on the extracted semantic
ρ(Vq , Vi ) between the query Q and each of the songs Ri in features the weight for each feature is determined by regres-
the database. They are similar if ρ(Vq , Vi ) is above the pre- sion method [9]. Unfortunately, the semantic feature extrac-
defined similarity threshold β. The similarity between two tion (high-level) used in [9] is originally and mainly proposed
feature vectors can be computed by several similarity mea- to summarize the audio feature sequence for musical genre
sures, Euclidean distance, cosine measure, etc. In this paper classification [18]. These semantic feature summarizations
we feed our semantic musicalaudio features into Euclidean cannot effectively represent melody-based lower-level music
space, d(Vq , Vi ) = Vq − Vi 2 (Vq 2 · Vi 2 ). information. We can find the most popular query-by-content
Short time Long time 3.3.2 Regression model
Feature function
dimension dimension A single feature can not well summarize a song. Multiple
MFCC 13 Mean(), Std(), 13*8 features must be combined to represent a song. These fea-
Mel-Magnit 40 Skew(), Kurt(), tures play different roles in the query and must be weighed
40*8
udes Mean(|ǻ|), Std(|ǻ|), by different weights. We introduce a scheme based on mul-
Chroma 12 Skew(|ǻ|), Kurb(|ǻ|) 12*8 tivariable regression to determine the weight for each fea-
Pitch 1 histogram 88 ture. The goal of our approach is to apply linear and non-
Total 25 groups 608 parametric regression models to investigate the correlation.
In the model we use K (K=25 according to Figure 3)
groups of features. Let the groups of features of ith song be
Figure 3: Dimension of features.
vi1 , vi2 , ..., viK . With different weight αk assigned to each
feature group, the total summary vector is
audio retrieval techniques in Figure 1 in terms of audio se- Vi = [α1 vi1 , α2 vi2 , ..., αK viK ]T (4)
quence representations. Different features represent differ-
ent aspects of audio signals and were proposed for different The Euclidean distance between two features Vi and Vj is
purposes. For example, Pitch [5] is extracted from the sung
melody to generate a note sequence while semitone-based
K
K
K
d(Vi , Vj ) = d( αk vik , αk vjk ) = α2k d(vik , vjk ) (5)
Chroma [3] is used to tolerate differences in instrumentation
k=1 k=1 k=1
and general musical style.
In [7] the large scale generation of long-term audio fea- To determine the weight of the above equation, we apply
tures is used to obtain concise and interpretable features multivariable regression process. For training purpose, we
summarizing a complete song and indicate the probability select from the database M pairs of songs, which contain
of this song belonging to a certain group. Frequently-used both similar pairs and non-similar pairs. By these pairs
features in Figure 1 and musicminer-generated features [7] we obtain the sequences of Chroma similar to [3], and then
encourage us to combine them by using multivariable regres- calculate M pairs of sequence similarity D(Ri , Rj ).
sion to train a group of features that are summarized as a We will choose the weight in Eq.(4) so that the similarity
simple descriptor to represent the characteristics of one au- calculated by the summary is as near to the sequence sim-
dio sequence as comprehensively and effectively as possible. ilarity as possible, i.e., we hope the melody information is
The dimensionality of final musical audio features is kept contained in the summary. After we determine the similarity
low through the summarization created by our proposed re- between the pairs of training data, we get a M *K matrix DV
gression model. The goal is to avoid the heavy computation and the M -dimensional column vector DDP . The ith row of
of feature sequence comparisons and make query-by-content DV has K distance values calculated from independent fea-
music retrieval for large databases possible. tures d(vik , vjk ), k = 1, 2, ..., K and the ith element of DDP
The feature sets used in training the regression model are is the normalized distance between the two feature sequences
given in Figure 3. They are used in audio content retrieval. D(Ri , Rj ) · K/(|Ri | · |Rj |). Let A = [α21 , α22 , · · · , α2K ]T . Then
For a description of the feature extraction methods the in- A = (DVT DV )−1 DVT DDP and we obtain the weight α. We
terested reader is referred to [3, 5, 7, 10, 19, 20]. are only interested in the absolute value of α. The feature
set defined in Eq.(4) is called the Federated Features (FF)
3.3.1 Similarity-Invariance of Summarization set.
Two questions occur in the summarizing stage: 1) how to
summarize the high dimensional features? and, 2) how to 3.4 KNN and LSH/E2 LSH Searching
guarantee that a summarized feature descriptor reflects the After training a group of weights for federated features, we
melody information? As for the first question, there are sev- obtain the combined independent high dimensional feature
eral summarizing methods such as calculating the mean and set for each audio sequence. It can be used together with
standard deviation of all the features, PCA, etc. As for the KNN and LSH/E2 LSH.
second question the summarizing procedure should exhibit Here we first introduce the indexing structure of datasets
substantive characteristics of similarity-invariance, i.e., simi- with LSH/E2 LSH. According to Figure 2, the federated fea-
lar melodies lead to similar summaries, non-similar melodies tures of the songs in the datasets are stored in the hash tables
lead to non-similar summaries. To solve the above issues a in terms of their hash values. When LSH is used, the hash
basic melody-based summarization principle is considered as values are directly calculated from the FF and each hash
follows. ri,j is the j th spectral feature of the ith reference instance has its own hash functions. On the other hand,
song. The sequence similarity between ith song and kth song when E2 LSH is used, the FF is first projected to a lower di-
is ϕik ({rij }, {rkj }). The ith audio feature sequence {ri,j } is mension vector in each hash instance. Then hash values are
summarized to Vi . The similarity between the ith and kth calculated. In Figure 2 the total effect of the hash function
feature summary is ψik (Vi , Vk ). A similarity threshold θ or is represented as Hk (.) for the kth hash instance.
θ would state how close between any two feature sequences Consider a query Vq and its relevant song Vi in the dataset.
or summarized features. With a good summarization we With KNN, Vq is compared against each of the song in the
could expect that dataset and the one with least distance is regarded as the
relevant song. When LSH/E2 LSH is used, instead of per-
ϕik ({rij }, {rkj }) > θ ⇔ ψik (Vi , Vk ) > θ (3) forming an exhaustive search, from the query Vq , its hash
In this sense the summarization is similarity-invariant. value Hk (Vq ) is calculated for the kth hash instance. The
features located in the bucket of Hk (Vq ) are then found and
Vq is only compared against these features. With locality
sensitive mapping, Vq and its relevant song Vi have the same 1
hash values Hk (Vq ) = Hk (Vi ) in at least one of the hash in-
stances with a high probability. Therefore Vi will probably 0.8
be located in the buckets indexed by Hk (Vq ) and be found. MRR
MRR(1)
0.6
4. EXPERIMENTS
0.4
In this section we introduce our experimental setup, show
some evaluation results over a large audio datasets, and
0.2
demonstrate a novel content-based music information de-
tection system with multi-variant audio (i.e., cover song)
0
queries. mean mean mean hist FF
MFCC Mel Chroma Pitch Regress
4.1 Experimental Setup
Our music collection includes 4121 songs that fall into four
non-overlapping datasets. Trains80 is collected from our Figure 4: MRR achieved by different features.
personal collections and www.yyfc.com (a non-commercial
amusement website where users can sing her/his favorite 1
songs, make records online, and share them with friends
or the yyfc community). It consists of 40 popular songs 0.8
each represented in two versions and 80 single-version songs.
These 160 tracks are used to train the weights of regres-
MRR(1)
0.6
sion model proposed in section 3.3.2. Covers79 is also col-
lected from www.yyfc.com and consists of 79 popular Chi- 0.4
nese songs each represented in different versions (sung by KNN
different people with possibly similar background music). LSH
0.2
Each song on average has 13.5 versions, resulting in a total of E2LSH
1072 audio tracks. The corpora RADIO (1431) and ISMIR
0
(1458) are used as noise background datasets and respec-
2 4 6 8 10 12 14
tively collected from the public websites www.shoutcast.com
and http://ismir2004.ismir.net/genre contest/index.htm. Number of hash instances
Each song is 30s long in mono-channel wave format, 16bit
per sample and the sampling rate is 22.05KHz. The audio Figure 5: Mean reciprocal rank at different number
data is normalized and then divided into overlapped frames. of hash instances.
Each frame contains 1024 samples and the adjacent frames
have 50% overlap. Each frame is weighed by a hamming win-
dow and further appended with 1024 zeros to fit the length as the evaluation metric. The top 20 retrieved songs were
of FFT (2048 point). From FFT result the instantaneous analyzed by default.
frequencies are extracted and Chroma is calculated. From First we trained the weights used in the FF sets by us-
the amplitude spectrum pitch, MFCC and Mel-magnitude ing the Trains80 dataset as the ground truth. We selected
are calculated. Then the summary is calculated from all 40 pairs of similar songs (each pair includes two versions of
frames. the same song) and 40 pairs of non-similar songs (one pair
The ground truth is set up according to human perception. includes two different songs). We selected some long term
We have listened to all the songs and manually labeled them features as the final semantic audio representation according
so that retrieval results of our algorithms correspond to hu- to their importance. Figure 4 compares MRR(1) achieved
man perception to support practical application. Trains80 by different features utilizing exhaustive KNN search. With
and Covers79 datasets were divided into groups according to each feature, 993 queries were performed against 2968 tracks.
their verse (the main theme represented by the song lyrics) FF has the best performance in comparison with other sep-
to judge whether songs belong to the same group or not (one arate competitive feature sets. Moreover it also reveals that
group represents one song and different versions of one song FF set can represent human perception more effectively.
are members in this group). The 30s segments in these two However, MRR(1) of FF is only 0.823. It can not reach
datasets are extracted from verse sections of tracks. 1 even by the exhaustive search due to the following factor:
The experiments were run on a PC with an Intel Core2 some of the perceptually similar tracks have quite different
CPU (2GHz) and 1GB DRAM under Microsoft Windows features and result in quite large distances in feature space.
XP Professional OS. Figure 5 shows MRR(1) of the three schemes with respect
to different numbers of hash instances. KNN always per-
4.2 Performance Evaluation forms best and E2 LSH always has the worst performance.
Our task is to detect the musical audio content with multi- When there are only two hash instances, LSH performs as
variant tracks. In this section we provide some experimental poorly as E2 LSH. However LSH outperforms E2 LSH when
evaluation results. The whole evaluation was run over 2968 the number of hash instances is greater than 3. MRR(1) in-
(79+1431+1458) audio tracks, where the original track of creases as the number of hash instances does. However this
each song in Cover79 were put in the dataset. The rest increase is not linear. Their performances start approaching
(1072-79=993) tracks in Cover79 were used as queries. Mean steadiness when the number of hash instances increases to
reciprocal rank (MRR) of the ground truth were calculated 10. Further increase in hash instances results in diminishing
track selected as query, the system can give a ranked list of
KNN E2LSH LSH the retrieval result. The first eight relevant audio tracks are
1
reported automatically with the best similarity audio track
as the first. Both queries and retrieval results can be played
0.8
to confirm the correctness of searching.
This system not only retrieves relevant audio tracks from
MRR(1)
0.6 the datasets, but also evaluates the performance of our ap-
proaches. In the current setting in Figure 7 Feature “FF”,
0.4 Similarity Searching “LSH” and Query Group ID “5” are set
up respectively. In “Ranked List”, the retrieval results are
0.2 ordered and the first eight relevant audio tracks are given in
the list. The corresponding evaluation result is also given at
0 the right side. We can easily see that with the third variant
79 79+1431 79+1431+1458 of the song in Group 5 as query input after searching over
the datasets its relevant audio track appears in the second
position (MRR(1) is 0.5).
Figure 6: MRR at different database sizes.
5. CONCLUSION
With more and more personal music recordings available
via the WWW, there is an increasing demand in develop-
ing tools for detecting multi-variant audio music recordings.
The goal of such retrieval tools is to rank a collection of
music recordings according to their similarity. To help pro-
cess multi-variant audio tracks, musical audio feature ex-
traction plays an important role in audio content retrieval
and searching. A good audio feature not only needs to ef-
fectively represent musical sequence characteristics but also
is easily mapped to an indexable format. In this paper we
introduced an index-based semantic framework for speeding
up the content-based audio retrieval. We proposed a seman-
tic regression model to train a meaningful semantic audio
summarization called Federated Features (FF). Three simi-
larity searching schemes were adopted to test our proposed
approach to musical audio representation. The experimen-
tal results demonstrate that our weighting scheme is useful
for audio content detection over the large databases. We
also developed a retrieval system to not only retrieve audio
content with a batch of multi-variant audio tracks, but also
Figure 7: Searching with multi-variant audio con- evaluate the performance of searching schemes.
tent.
6. ACKNOWLEDGMENTS
returns. This is because feature summarization results in We thank Initiative Project of Nara Women’s Unveirsity
information loss which can not be salvaged by the increase for offering an opportunity to study abroad. This work was
of hash instances. Therefore the number of hash table is set partly discussed and done when Yi visited International Mu-
to 10 in the following experiment. sic Information Retrieval System Evaluation Laboratory (IM
Figure 6 shows the effect of different database sizes. The IRSEL) in summer, 2007. The second author is a founder of
three databases respectively contain Cover79 (79), Cover79 IMIRSEL at UIUC, and was supported by the Andrew W.
& RADIO (79+1431), Cover79 & RADIO & ISMIR (79+1431 Mellon and national Science Foundation (NSF) under Nos.
+1458). It is obvious that the increase of database size has IIS-0340597 IIS-0327371. The fourth author was partially
very little effect on the MRR(1), which confirms that the FF supported by a grant from DoD-ARL through the KIMCOE
set can effectively distinguish the (newly added) non-similar center of Excellence.
tracks.
7. REFERENCES
4.3 Searching Datasets with Multi-Variant Au- [1] J. S. Downie. The Music Information Retrieval Evalu-
dio Tracks ation eXchange (MIREX). In D-Lib Magazine 12, 2006.
Figure 7 illustrates our content-based music information http://dlib.org/dlib/december06/downie/12downie.html.
retrieval demo system developed in C++. From the left [2] J. P. Bello. Audio-based Cover Song Retrieval Using
side, the features and similarity searching methods can be Approximate Chord Sequences: Testing Shifts, Gaps,
selected. In this demo system, the query set consists of 79 Swaps and Beats. ISMIR’07, pp.239-244, 2007.
groups with 1072-79=993 audio tracks (the original track of [3] D. Ellis and G. Poliner. Identifying cover songs with
each song is in the datasets). Any group ID can be freely chroma features and dynamic programming beat
specified to show the query audio tracks. With any audio tracking. ICASSP’07, 2007.
[4] Y. Yu, K. Joe, and J. S. Downie. Efficient Query-by- [11] LSH Algorithm and Implementation (E2 LSH)
Content Audio Retrieval by Locality Sensitive Hashing http://web.mit.edu/andoni/www/LSH/index.html.
and Partial Sequence Comparison. IEICE Transaction [12] P. Indyk and N. Thaper. Fast color image retrieval via
on Information and System, Vol.E91-D, No.6, embeddings. Workshop on Statistical and
pp1730-1739, 2008. Computational Theories of Vision (ICCV), 2003.
[5] Y. Yu, J. S. Downie, and K. Joe. An Evaluation of [13] S.Y. Hu. Efficient Video Retrieval by Locality
Feature Extraction for Query-by-Content Audio Sensitive Hashing. ICASSP’05, pp.449-452, 2005.
Information Retrieval. Ninth IEEE International [14] J. Reiss, J. J. Aucouturier, and M. Sandler. Efficient
Symposium on Multimedia Workshops (ISMW), pp. multi dimensional searching routines for music
297-302, 2007. information retrieval. ISMIR’01, 2001.
[6] Y. Yu, M. Takata, and K. Joe. Index-Based Similarity [15] I. Karydis, A. Nanopoulos, A. N. Papadopoulos and
Searching with Partial Sequence Comparison for Y. Manolopoulos. Audio Indexing for Efficient Music
Query-by-Content Audio Retrieval. Workshop on Information Retrieval. MMM’05, pp.22-29, 2005.
Learning Semantics of Audio Signals (LSAS’06), [16] M. Casey and M. Slaney. Song Intersection by
pp.76-86, 2006. Approximate Nearest Neighbor Search. ISMIR’06,
[7] F.Moerchen, I. Mierswa, and A. Ultsch. pp.144-149, 2006.
Understandable Models of Music Collection based on [17] M. Lesaffre and M. Leman. Using Fuzzy to Handle
Exhaustive Feature Generation with Temporal Semantic Descriptions of Music in a Content-based
Statistics. KDD’06, pp.882-891, 2006. Retrieval System. Workshop on Learning Semantics of
[8] C.Yang. Efficient Acoustic Index for Music Retrieval Audio Signals (LSAS’06), pp.43-5, 2006.
with Various Degrees of Similarity. ACM Multimedia, [18] G. Tzanetakis and P. Cook. Musical Genre
pp.584-591, 2002. Classification of Audio Signals. IEEE Transactions on
[9] B. Cui, J.L. Shen, G. Cong, H.T. Shen, and C. Yu. Speech and Audio Processing, Vol.10, No.5, pp.
Exploring Composite Acoustic Features for Efficient 293-302, 2002.
Music Similarity Query. ACM MM’06, pp.634-642, [19] R. Miotto and N. Orio. A Methodology for the
2006. Segmentation and Identification of Music Works.
[10] T. Pohle, M. Schedl, P. Knees, and G. Widmer. ISMIR’07, pp.239-244, 2007.
Automatically Adapting the Structure of Audio [20] L.Rabiner and B.-H. Juang. Fundamentals of Speech
Similarity Spaces. Workshop on Learning Semantics of Recognition. Prentice-Hall, 1993.
Audio Signals (LSAS’06), pp.66-75, 2006.