Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression
Clustering by Compression its universality and robustness by experiments and results in a plethora
of different areas using different types of compressors.
Rudi Cilibrasi and Paul M. B. Vitányi Feature-Based Similarities: We are presented with unknown data
and the question is to determine the similarities among them and group
Abstract—We present a new method for clustering based on compression. like with like together. Commonly, the data are of a certain type: music
The method does not use subject-specific features or background knowl- files, transaction records of ATM machines, credit card applications,
edge, and works as follows: First, we determine a parameter-free, universal, genomic data. In these data, there are hidden relations that we would
similarity distance, the normalized compression distance or NCD, com- like to get out into the open. For example, from genomic data one can
puted from the lengths of compressed data files (singly and in pairwise con- extract letter or block frequencies (the blocks are over the four-letter
catenation). Second, we apply a hierarchical clustering method. The NCD
is not restricted to a specific application area, and works across applica- alphabet); from music files one can extract various specific numerical
tion area boundaries. A theoretical precursor, the normalized information features, related to pitch, rhythm, harmony, etc. One can extract such
distance, co-developed by one of the authors, is provably optimal. How- features using, for instance, Fourier transforms [43] or wavelet trans-
ever, the optimality comes at the price of using the noncomputable notion forms [17], to quantify parameters expressing similarity. The resulting
of Kolmogorov complexity. We propose axioms to capture the real-world
setting, and show that the NCD approximates optimality. To extract a hi-
vectors corresponding to the various files are then classified or clus-
erarchy of clusters from the distance matrix, we determine a dendrogram tered using existing classification software, based on various standard
(ternary tree) by a new quartet method and a fast heuristic to implement it. statistical pattern recognition classifiers [43], Bayesian classifiers [15],
The method is implemented and available as public software, and is robust hidden Markov models [13], ensembles of nearest neighbor classifiers
under choice of different compressors. To substantiate our claims of univer- [17], or neural networks [15], [39]. For example, in music, one fea-
sality and robustness, we report evidence of successful application in areas
as diverse as genomics, virology, languages, literature, music, handwritten ture would be to look for rhythm in the sense of beats per minute. One
digits, astronomy, and combinations of objects from completely different can make a histogram where each histogram bin corresponds to a par-
domains, using statistical, dictionary, and block sorting compressors. In ticular tempo in beats-per-minute and the associated peak shows how
genomics, we presented new evidence for major questions in Mammalian frequent and strong that particular periodicity was over the entire piece.
evolution, based on whole-mitochondrial genomic analysis: the Eutherian
orders and the Marsupionta hypothesis against the Theria hypothesis.
In [43], we see a gradual change from a few high peaks to many low
and spread-out ones going from hip-hip, rock, jazz, to classical. One
Index Terms—Heterogenous data analysis, hierarchical unsupervised can use this similarity type to try to cluster pieces in these categories.
clustering, Kolmogorov complexity, normalized compression distance,
parameter-free data mining, quartet tree method, universal dissimilarity
However, such a method requires specific and detailed knowledge of
distance. the problem area, since one needs to know what features to look for.
Non-Feature Similarities: Our aim is to capture, in a single simi-
larity metric, every effective distance: effective versions of Hamming
I. INTRODUCTION distance, Euclidean distance, edit distances, alignment distance,
All data are created equal but some data are more alike than others. Lempel–Ziv distance [11], and so on. This metric should be so general
We propose a method expressing this alikeness, using a new similarity that it works in every domain: music, text, literature, programs,
metric based on compression. It is parameter-free in that it does not genomes, executables, natural language determination, equally and si-
use any features or background knowledge about the data, and can multaneously. It would be able to simultaneously detect all similarities
without changes be applied to different areas and across area bound- between pieces that other effective distances can detect seperately.
aries. It is universal in that it approximates the parameter expressing Compression-Based Similarity: Such a “universal” metric was
similarity of the dominant feature in all pairwise comparisons. It is ro- co-developed by us in [29]–[31], as a normalized version of the “infor-
bust in the sense that its success appears independent from the type mation metric” of [4], [32]. Roughly speaking, two objects are deemed
of compressor used. The clustering we use is hierarchical clustering close if we can significantly “compress” one given the information in
in dendrograms based on a new fast heuristic for the quartet method. the other, the idea being that if two pieces are more similar, then we
The method is available as an open-source software tool. Below we ex- can more succinctly describe one given the other. The mathematics
plain the method, the theory underpinning it, and present evidence for used is based on Kolmogorov complexity theory [32]. In [31], we
defined a new class of (possibly nonmetric) distances, taking values
in [0; 1] and appropriate for measuring effective similarity relations
Manuscript received March 31, 2003; revised December 29, 2004. The mate- between sequences, say one type of similarity per distance, and vice
rial in this correspondence was presented in part at the IEEE International Sym-
posium on Information Theory, Yokohama, Japan, June/July 2003. The work of versa. It was shown that an appropriately “normalized” information
R. Cilibrasi was supported in part by The Netherlands BSIK/BRICKS Project distance minorizes every distance in the class. It discovers all effective
and by the NWO Project 612.55.002. The work of P. M. B. Vitányi was sup- similarities in the sense that if two objects are close according to some
ported in part by the EU Project RESQ, IST-2001-37559, the NoE QUIPRO- effective similarity, then they are also close according to the normal-
CONE IST-1999-29064, the ESF QiT Programmme, and the EU NoE PASCAL,
The Netherlands BSIK/BRICKS Project, and the KRR and SML&KA Programs
ized information distance. Put differently, the normalized information
of National ICT of Australia. Part of his work was performed while P. M. B. distance represents similarity according to the dominating shared
Vitányi was on sabbatical leave at the National ICT of Australia, Sydney Labo- feature between the two objects being compared. In comparisons of
ratory at the University of New South Wales. more than two objects, different pairs may have different dominating
R. Cilibrasi is with the Centre for Mathematics and Computer Science (CWI),
features. The normalized information distance is a metric and takes
P. O. Box 94079, NL-1090 GB Amsterdam, The Netherlands (e-mail: Rudi.Cili-
[email protected]). values in [0; 1]; hence, it may be called the similarity metric. To
P. M. B. Vitányi is with the Centre for Mathematics and Computer Sci- apply this ideal precise mathematical theory in real life, we have to
ence (CWI), P. O. Box 94079, NL-1090 GB Amsterdam, The Netherlands replace the use of the noncomputable Kolmogorov complexity by
(e-mail:[email protected]). He is also affiliated with the University of an approximation using a standard real-world compressor. Earlier
Amsterdam and National ICT of Australia.
Communicated by J. C. Kieffer, W. Szpankowski, and E.-h. Yang, Guest Ed- approaches resulted in the first completely automatic construction of
itors for the Special Issue on Problems on Sequences. the phylogeny tree based on whole mitochondrial genomes, [29]–[31],
Digital Object Identifier 10.1109/TIT.2005.844059 a completely automatic construction of a language tree for over 50
Euro-Asian languages [31], detects plagiarism in student programming well the resulting tree represents the information in the distance ma-
assignments [8], gives phylogeny of chain letters [5], and clusters trix on a scale of 0 to 1. Then, as proof of principle, we run the pro-
music [10]. Moreover, the method turns out to be robust under change gram on three data sets, where we know what the final answer should
of the underlying compressor-types: statistical (PPMZ), Lempel–Ziv be: i) reconstruct a tree from a distance matrix obtained from a ran-
based dictionary (gzip), block based (bzip2), or special purpose domly generated tree; ii) reconstruct a tree from files containing arti-
(Gencompress). ficial similarities; and iii) reconstruct a tree from natural files of het-
Related Work: In view of the simplicity and naturalness of our erogenous data of vastly different types. To substantiate our claim of
proposal, it is perhaps surprising that compression-based clustering parameter-freeness and universality, we apply the method to different
and classification approaches did not arise before. But recently there areas, not using any feature analysis at all. We first give an example
have been several partially independent proposals in that direction: [1], in whole-genome phylogeny using the whole mitochondrial DNA of
[2] for author attribution and building language trees—while citing the the species concerned. We compare the hierarchical clustering of our
earlier work [4], [32]—does not develop a theory based on information method with a more standard method of two-dimensional clustering (to
distance but proceeds by more ad hoc arguments related to the com- show that our dendrogram method of depicting the clusters is more in-
pressibility of a target file after first compressing a reference file. The formative). We give a whole-genome phylogeny of fungi and compare
better the target file compresses, the more we feel it is similar to the this to results using alignment of selected proteins (alignment being
reference file in question. See also the explanation in [31, Appendix I]. often too costly to perform on the whole-mitochondial genome, but the
This approach is used also to cluster music MIDI (Musical Instrument disadvantage of protein selection being that different selections usually
Digital Interface, a versatile digital music format available on the result in different phylogenies—so which is right?). We identify the
World-Wide Web) files by Kohonen maps in [33]. Another recent virii that are closest to the sequenced SARS virus; we give an example
offshoot based on our work is hierarchical clustering based on mutual of clustering of language families; Russian authors in the original Rus-
information, [23]. In a related, but considerably simpler feature-based sian, the same pieces in English translation (clustering partially follows
approach, one can compare the word frequencies in text files to assess the translators); clustering of music in MIDI format; clustering of hand-
similarity. In [42], the word frequencies of words common to a pair of written digits used for optical character recognition; and clustering of
text files are used as entries in two vectors, and the similarity of the radio observations of a mysterious astronomical object, a microquasar
two files is based on the distance between those vectors. The authors of extremely complex variability. In all these cases, the method per-
attribute authorship to Shakespeare plays, the Federalist Papers, and forms very well in the following sense: The method yields the phy-
the Chinese classic “The Dream of the Red Chamber.” This approach logeny of 24 species agreeing with biological wisdom insofar as it is un-
based on block occurrence statistics is standard in genomics, but in an controversial. The probability that it randomly would hit this outcome,
experiment reported in [31] gives inferior phylogeny trees compared or anything reasonably close, is very small. In clustering 36 music
to our compression method (and wrong ones according to current pieces taken equally from pop, jazz, and classic, so that 12–12–12 is
biological wisdom). A related, opposite, approach was taken in [22], the grouping we understand is correct, we can identify convex clusters
where literary texts are clustered by author gender or fact versus so that only six errors are made. (That is, if three items get dislodged
fiction, essentially by first identifying distinguishing features, like without two of them being interchanged, then six items get misplaced.)
gender-dependent word usage, and then classifying according to those The probability that this happens by chance is extremely small. The
features. Apart from the experiments reported here, the clustering reason why we think the method does something remarkable is con-
by compression method reported in this correspondence has recently cisely put by Laplace [28]:
been used to analyze network traffic and cluster computer worms and
“If we seek a cause wherever we perceive symmetry, it is not that
virusses [44]. Finally, recent work [20] reports experiments with our
we regard the symmetrical event as less possible than the others,
method on all time sequence data used in all the major data-mining
but, since this event ought to be the effect of a regular cause or
conferences in the last decade. Comparing the compression method
that of chance, the first of these suppositions is more probable
with all major methods used in those conferences they established clear
than the second. On a table we see letters arranged in this order
superiority of the compression method for clustering heterogenous
Constantinople, and we judge that this arrangement is not
data, and for anomaly detection. See also the explanation in [31,
the result of chance, not because it is less possible than others,
Appendix II].
for if this word were not employed in any language we would not
Outline: Here we propose a first comprehensive theory of real-world
suspect it came from any particular cause, but this word being in
compressor-based normalized compression distance, a novel hierar-
use among us, it is incomparably more probable that some person
chical clustering heuristic, together with several applications. First, we
has thus arranged the aforesaid letters than that this arrangement
propose mathematical notions of “admissible distance” (using the term
is due to chance.”
for a wider class than we did in [31]), “normalized admissible distance”
or “similarity distance,” “normal compressor,” and “normalized com- Materials and Methods: The data samples we used were obtained
pression distance.” We then prove the normalized compression distance from standard databases accessible on the World-Wide Web, generated
based on a normal compressor to be a similarity distance satisfying the by ourselves, or obtained from research groups in the field of inves-
metric (in)equalities. The normalized compression distance is shown tigation. We supply the details with each experiment. The method of
to be quasi-universal in the sense that it minorizes every computable processing the data was the same in all experiments. First, we prepro-
similarity distance up to an error that depends on the quality of the cessed the data samples to bring them in appropriate format: the ge-
compressor’s approximation of the true Kolmogorov complexities of nomic material over the four-letter alphabet fA; T ; G; C g is recoded
the files concerned. This means that the NCD captures the dominant in a four-letter alphabet; the music MIDI files are stripped of iden-
similarity over all possible features for every pair of objects compared, tifying information such as composer and name of the music piece.
up to the stated precision. Note that different pairs of objects may have Then, in all cases, the data samples were completely automatically pro-
different dominant shared features. Next, we present a method of hi- cessed by our CompLearn Toolkit, rather than as is usual in phylogeny,
erarchical clustering based on a novel fast randomized hill-climbing by using an ecclectic set of software tools per experiment. Oblivious
heuristic of a new quartet tree optimization criterion. Given a matrix to the problem area concerned, simply using the distances according
of the pairwise similarity distances between the objects, we score how to the NCD below, the method described in this correspondence fully
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1525
automatically classifies the objects concerned. The method has been with phylogeny hypotheses, the clustering can give true support of
released into the public domain as open-source software. The Com- one hypothesis against another one.
pLearn Toolkit [9] is a suite of simple utilities that one can use to apply Figures: We use two styles to display the hierarchical clusters. In
compression techniques to the process of discovering and learning pat- the case of genomics of Eutherian orders and fungi, language trees,
terns in completely different domains. In fact, this method is so general it is convenient to follow the dendrograms that are customary in
that it requires no background knowledge about any particular subject that area (suggesting temporal evolution) for easy comparison with
area. There are no domain-specific parameters to set, and only a handful the literature. Although there is no temporal relation intended, the
of general settings. dendrogram representation looked also appropriate for the Russian
The CompLearn Toolkit using NCD and not, say, alignment, can writers, and translations of Russian writers. In the other experiments
cope with full genomes and other large data files and thus comes up (even the genomic SARS experiment) it is more informative to
with a single distance matrix. The clustering heuristic generates a tree display an unrooted ternary tree (or binary tree if we think about
with a certain confidence, called standardized benefit score or S (T ) incoming and outgoing edges) with explicit internal nodes. This
value in the sequel. Generating trees from the same distance matrix facilitates identification of clusters in terms of subtrees rooted at
many times resulted in the same tree in case of high S (T ) value, or internal nodes or contiguous sets of subtrees rooted at branches of
a similar tree in case of moderately high S (T ) value, for all distance internal nodes.
matrices we used, even though the heuristic is randomized. That is,
there is only one way to be right, but increasingly many ways to be
increasingly wrong which can all be realized by different runs of the II. SIMILARITY DISTANCE
randomized algorithm. This is a big difference compared to previous We give a precise formal meaning to the loose distance notion of
phylogeny methods, where because of computational limitations one “degree of similarity” used in the pattern recognition literature.
uses only parts of the genome, or certain proteins that are viewed as
significant [21]. These are run through a tree reconstruction method A. Distance and Metric
like neighbor joining [38], maximum likelihood, maximum evolution,
maximum parsimony as in [21], or quartet hypercleaning [6], many Let
be a nonempty set and R+ be the set of nonnegative real
times. The percentage-wise agreement on certain branches arising are numbers. A distance function on
is a function D :
2
! R+ . It
called “bootstrap values.” Trees are depicted with the best bootstrap is a metric if it satisfies the following metric (in)equalities:
values on the branches that are viewed as supporting the theory tested. • D(x; y ) = 0 iff x = y ,
Different choices of proteins result in different best trees. One way • D(x; y ) = D(y; x) (symmetry), and
to avoid this ambiguity is to use the full genome, [31], [36], leading • D(x; y ) D(x; z ) + D(z; y ) (triangle inequality).
to whole-genome phylogeny. With our method we do whole-genome The value D(x; y ) is called the distance between x; y 2
. A familiar
phylogeny, and end up with a single overall best tree, not optimizing example of a distance that is also metric is the Euclidean metric,
the everyday distance e(a; b) between two geographical objects a, b
selected parts of it.
The quality of the results depends on a) the NCD distance matrix,
expressed in, say, meters. Clearly, this distance satisfies the properties
e(a; a) = 0, e(a; b) = e(b; a), and e(a; b) e(a; c) + e(c; b) (for
and b) on how well the hierarchical tree represents the information in
the matrix. The quality of b) is measured by the S (T ) value, and is
instance, a = Amsterdam, b = Brussels, and c = Chicago). We are
given with each experiment. In general, the S (T ) value deteriorates
interested in a particular type of distance, the “similarity distance,”
for large sets. We believe this to be partially an artifact of a low-
which we formally define in Definition 2.5. For example, if the
resolution NCD matrix due to limited compression power, and limited
objects are classical music pieces then the function D defined by
file size. The main reason, however, is the fact that with increasing
D (a; b) = 0 if a and b are by the same composer and D (a; b) =
size of a natural data set the projection of the information in the
NCD matrix into a binary tree unavoidably gets increasingly distorted. 1 otherwise, is a similarity distance that is also a metric. This
Another aspect limiting the quality of the NCD matrix is more subtle. metric captures only one similarity aspect (feature) of music pieces,
Recall that the method knows nothing about any of the areas we apply presumably an important one that subsumes a conglomerate of more
it to. It determines the dominant feature as seen through the NCD elementary features.
filter. The dominant feature of alikeness between two files may not
correspond to our a priori assumption but may have an unexpected B. Admissible Distance
cause. The results of our experiments suggest that this is not often In defining a class of admissible distances (not necessarily metric
the case: In the natural data sets where we have preconceptions of the distances) we want to exclude unrealistic ones like f (x; y ) = 12 for
outcome, for example, that works by the same authors should cluster every pair x 6= y . We do this by restricting the number of objects within
together, or music pieces by the same composers, musical genres, a given distance of an object. As in [4], we do this by only considering
or genomes, the outcomes conform largely to our expectations. For effective distances, as follows: Fix a suitable, and for the remainder of
example, in the music genre experiment, the method would fail
the paper, fixed, programming language. This is the reference program-
dramatically if genres were evenly mixed, or mixed with little bias.
ming language.
However, to the contrary, the separation in clusters is almost perfect.
The few misplacements that are discernable are either errors (the Definition 2.1: Let
= 63 , with 6 a finite nonempty alphabet
method was not powerful enough to discern the dominant feature), and 63 the set of finite strings over that alphabet. Since every finite
the distortion due to mapping multidimensional distances into tree alphabet can be recoded in binary, we choose 6 = f0; 1g. In partic-
distances, or the dominant feature between a pair of music pieces ular, “files” in computer memory are finite binary strings. A function
is not the genre but some other aspect. The surprising news is D :
2
! R
+ is an admissible distance if for every pair of objects
that we can generally confirm expectations with few misplacements, x; y 2
the distance D (x; y ) is the length of a binary prefix codeword
indeed, that the data do not contain unknown rogue features that that is a program that computes x from y , and vice versa, in the refer-
dominate to cause spurious (in our preconceived idea) clustering. ence programming language. This implies that an admissible distance
This gives evidence that where the preconception is in doubt, like is symmetric.
1526 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
If D is an admissible distance, then for every x, the set fD(x; y ) : It follows from the definitions that a normalized admissible distance
y 2 f0; 1g3 g is the length set of a prefix code. Hence, it satisfies by is a function d :
2
! [0; 1] that is symmetric: d(x; y ) = d(y; x).
Lemma 2.6: For every x 2
, and constant e 2 [0; 1], a normalized
the Kraft inequality, see [12],
20D(x;y) 1: (II.1)
admissible distance satisfies the density constraint
z = x or z = y ). By (II.2), for every x, the number of y in the Ham- This number C (y jx) of bits of information in y , relative to x, can be
ming ball hn (x; y ) e is less than 2eH (x)+1 . This upper bound is viewed as the excess number of bits in the compressed version of xy
an obvious overestimate for e 1= log n. For lower values of e, the compared to the compressed version of x, and is called the amount of
upper bound is correct by the observation that the number of y ’s equals conditional compressed information.
en n
In the definition of compressor, the decompression algorithm is not
en
2nH e( )
included (unlike the case of Kolmorogov complexity, where the decom-
i=0 pressing algorithm is given by definition), but it is easy to construct one.
Given the compressed version of x in C (x) bits, we can run the com-
where H (e) = e log e+(10e) log(10e), Shannon’s entropy function. pressor on all candidate strings z —for example, in length-increasing
Then, eHn+ (x) > en log n > enH (e) since e log n > H (e). lexicographical order, until we find the compressed version of string
z0 equals the compressed version of string x. Since this string decom-
III. NORMAL COMPRESSOR presses to x we have found x = z0 . Given the compressed version of
We give axioms determining a large family of compressors that both xy in C (xy) bits, we repeat this process using strings xz until we find
include most (if not all) real-world compressors and ensure the desired the string xz1 of which the compressed version equals the compressed
properties of the NCD to be defined later. version of xy . Since the former compressed version decompresses to
xy , we have found y = z1 . By the unique decompression property we
Definition 3.1: A compressor C is normal if it satisfies, up to an find that C (y jx) is the extra number of bits we require to describe y
additive O(log n) term, with n the maximal binary length of an element apart from describing x. It is intuitively acceptable that the conditional
of
involved in the (in)equality concerned, the following axioms. compressed information C (xjy ) satisfies the triangle inequality
1) Idempotency: C (xx) = C (x), and C () = 0, where is the
empty string.
C ( x jy ) C ( x jz ) + C ( z jy ) : (III.4)
2) Monotonicity: C (xy ) C (x).
3) Symmetry: C (xy ) = C (yx). Lemma 3.3: Both (III.1) and (III.4) imply (III.2).
4) Distributivity: C (xy ) + C (z ) C (xz ) + C (yz ). Proof: (Equation (III.1) implies (III.2):) By monotonicity.
(Equation (III.4) implies (III.2):) Rewrite the terms in (III.4) ac-
Idempotency: A reasonable compressor will see exact repetitions cording to (III.3), cancel C (y ) in the left- and right-hand sides, use
and obey idempotency up to the required precision. It will also com- symmetry, and rearrange.
press the empty string to the empty string.
Monotonicity: A real compressor must have the monotonicity prop- Lemma 3.4: A normal compressor satisfies additionally subaddi-
erty, at least up to the required precision. The property is evident for tivity: C (xy ) C (x) + C (y ).
stream-based compressors, and only slightly less evident for block- Proof: Consider the special case of distributivity with z the
coding compressors. empty word so that xz = x, yz = y , and C (z ) = 0.
Symmetry: Stream-based compressors of the Lempel–Ziv family, Subadditivity: The subadditivity property is clearly also required
like gzip and pkzip, and the predictive PPM family, like PPMZ, are pos- for every viable compressor, since a compressor may use information
sibly not precisely symmetric. This is related to the stream-based prop- acquired from x to compress y . Minor imprecision may arise from the
erty: the initial file x may have regularities to which the compressor unlearning effect of crossing the border between x and y , mentioned
adapts; after crossing the border to y it must unlearn those regularities in relation to symmetry, but again this must vanish asymptotically with
and adapt to the ones of y . This process may cause some imprecision increasing length of x, y .
in symmetry that vanishes asymptotically with the length of x, y . A
compressor must be poor indeed (and will certainly not be used to any IV. BACKGROUND IN KOLMOGOROV COMPLEXITY
extent) if it does not satisfy symmetry up to the required precision.
Apart from stream-based, the other major family of compressors is Technically, the Kolmogorov complexity of x given y is the length of
block-coding based, like bzip2. They essentially analyze the full input the shortest binary program, for the reference universal prefix Turing
block by considering all rotations in obtaining the compressed version. machine, that on input y outputs x; it is denoted as K (xjy ). For precise
It is to a great extent symmetrical, and real experiments show no de- definitions, theory, and applications, see [32]. The Kolmogorov com-
parture from symmetry. plexity of x is the length of the shortest binary program with no input
Distributivity: The distributivity property is not immediately in- that outputs x; it is denoted as K (x) = K (xj), where denotes the
tuitive. In Kolmogorov complexity theory the stronger distributivity empty input. Essentially, the Kolmogorov complexity of a file is the
property length of the ultimate compressed version of the file. In [4], the infor-
mation distance E (x; y ) was introduced, defined as the length of the
C (xyz ) + C (z ) C (xz ) + C (yz ) (III.1) shortest binary program for the reference universal prefix Turing ma-
chine that, with input x, computes y , and with input y , computes x. It
holds (with K = C ). However, to prove the desired properties of NCD was shown there that, up to an additive logarithmic term,
E (x; y) = maxfK (xjy); K (yjx)g:
below, only the weaker distributivity property
C (xy) + C (z ) C (xz ) + C (yz ) (III.2) It was shown also that E (x; y ) is a metric, up to negligible violations
of the metric inequalties. Moreover, it is universal in the sense that
above is required, also for the boundary case were C = K . In prac- for every admissible distance D(x; y ) as in Definition 2.1, E (x; y )
tice, real-world compressors appear to satisfy this weaker distributivity D(x; y) up to an additive constant depending on D but not on x and
property up to the required precision. y . In [31], the normalized version of E (x; y), called the normalized
Definition 3.2: Define information distance (NID), is defined as
NID(x; y ) =
maxfK (xjy); K (yjx)g :
C (yjx) = C (xy) 0 C (x): (III.3) maxfK (x); K (y)g (IV.1)
1528 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
It too is a metric, and it is universal in the sense that this single metric Case 2: Assume C(y) C(x). By symmetry C(xy) = C(yx).
minorizes up to a negligible additive error term all normalized admis- Now follow the proof of Case 1.
sible distances in the class considered in [31]. Thus, if two files (of what-
Lemma 5.2: If C is a normal compressor, then EC (x; y) satisfies
ever type) are similar (that is, close) according to the particular feature
the metric (in)equalities up to logarithmic additive precision.
described by a particular normalized admissible distance (not neces-
Proof: Only the triangular inequality is nonobvious. By (III.2),
C(xy) + C(z) C(xz) + C(yz) up to logarithmic additive preci-
sarily metric), then they are also similar (that is, close) in the sense of
the normalized information metric. This justifies calling the latter the
sion. There are six possibilities, and we verify the correctness of the
similarity metric. We stress once more that different pairs of objects
triangular inequality in turn for each of them.
Assume C(x) C(y) C(z). Then
may have different dominating features. Yet every such dominant sim-
ilarity is detected by the NID. However, this metric is based on the no-
tion of Kolmogorov complexity. Unfortunately, the Kolmogorov com-
plexity is noncomputable in the Turing sense. Approximation of the de- C(xy) 0 C(x) C(xz) 0 C(x) + C(yz) 0 C(y):
nominator of (IV.1) by a given compressor C is straightforward: it is
maxfC(x); C(y)g. The numerator is more tricky. It can be rewritten as Assume C(y) C(x) C(z). Then
f
max K(x; y) 0 K(x); K(x; y) 0 K(y)g (IV.2)
C(xy) 0 C(y) C(xz) 0 C(y) + C(yz) 0 C(x):
within logarithmic additive precision, by the additive property of
Kolmogorov complexity [32]. The term K(x; y) represents the length
of the shortest program for the pair (x; y). In compression practice
Assume C(x) C(z) C(y). Then
it is easier to deal with the concatenation xy or yx. Again, within
logarithmic precision K(x; y) = K(xy) = K(yx). Following a
C(xy) 0 C(x) C(xz) 0 C(x) + C(yz) 0 C(z):
C(z) C(x). Then
suggestion by Steven de Rooij, one can approximate (IV.2) best by
minfC(xy); C(yx)g 0 minfC(x); C(y)g. Here, and in the later Assume C(y)
experiments using the CompLearn Toolkit [9], we simply use C(xy)
rather than minfC(xy); C(yx)g. This is justified by the observation C(xy) 0 C(y) C(xz) 0 C(z) + C(yz) 0 C(y):
that block-coding-based compressors are symmetric almost by defini-
tion, and experiments with various stream-based compressors (gzip, Assume C(z) C(x) C(y). Then
PPMZ) show only small deviations from symmetry.
The result of approximating the NID using a real compressor C is C(xy) 0 C(x) C(xz) 0 C(z) + C(yz) 0 C(z):
called the normalized compression distance or NCD, formally defined
in (IV.1). The theory as developed for the Kolmogorov-complexity Assume C(z) C(y) C(x). Then
based NID in [31], may not hold for the (possibly poorly) approxi-
mating NCD. It is nonetheless the case that experiments show that the C(xy) 0 C(y) C(xz) 0 C(z) + C(yz) 0 C(z):
NCD apparently has (some) properties that make the NID so appealing.
To fill this gap between theory and practice, we develop the theory
of NCD from first principles, based on the axiomatics of Section III. Lemma 5.3: If C is a normal compressor, then
We show that the NCD is a quasi-universal similarity metric relative
to a normal reference compressor C . The theory developed in [31] is + f
EC (x; y) = max C(x); C(y) : g
the boundary case C = K , where the “quasi-universality” below has
become full “universality.” Proof: Consider a pair (x; y). The
V. COMPRESSION DISTANCE f
max C(xz) 0 C(z) : C(z) C(y)g
We define a compression distance based on a normal compressor
and show it is an admissible distance. In applying the approach, we is C(x) which is achieved for z = , the empty word, with C() = 0.
have to make do with an approximation based on a far less powerful Similarly, the
real-world reference compressor C . A compressor C approximates the
information distance E(x; y), based on Kolmogorov complexity, by f
max C(yz) 0 C(z) : C(z) C(x)g
the compression distance EC (x; y) defined as
EC (x; y) = C(xy) 0 minfC(x); C(y)g: (V.1) is C(y). Hence, the lemma.
Here, C(xy) denotes the compressed size of the concatenation of x
and y , C(x) denotes the compressed size of x, and C(y) denotes the VI. NORMALIZED COMPRESSION DISTANCE
compressed size of y . The normalized version of the admissible distance EC (x; y), the
compressor C based approximation of the normalized information dis-
Lemma 5.1: If C is a normal compressor, then EC (x; y) + O(1) is
tance (IV.1), is called the normalized compression distance or NCD
an admissible distance.
Proof: Case 1: Assume C(x) C(y). Then EC (x; y) =
C(xy) 0 C(x). Then, given x and a prefix program of length
C(xy) 0 minfC(x); C(y)g :
EC (x; y) consisting of the suffix of the C -compressed version of xy ,
NCD(x; y) =
f
max C(x); C(y) g (VI.1)
numbers represent more similar files. The in the upper bound is due and denominator. Although now the right-hand side
to imperfections in our compression techniques, but for most standard 1
decreases, it must still be greater than , and therefore,
01
compression algorithms one is unlikely to see an above : (in our the right-hand side remains at least as large as the left-
1
experiments, gzip and bzip2 achieved NCDs above , but PPMZ always hand side.
had NCD at most ). 1 ( )
b) NCD x; z NCD x; y ( )+ ( ) NCD y; z . By distributivity,
There is a natural interpretation to NCD(x; y ). If, say, C (y )
we have C xz( )+ ( )
C y C xy ( )+ ( ) C yz . Subtracting
C (z )
That is, the distance NCD(x; y ) between x and y is the improvement
due to compressing y using x as previously compressed “database,” The right-hand side does not decrease when we substitute
and compressing y from scratch, expressed as the ratio between the bit- () ()
C y for the denominator C z of the first term, since
wise length of the two compressed versions. Relative to the reference
compressor, we can define the information in x about y as C y 0 () () ()
C y C z . Therefore, the inequality stays valid under
( )
C yjx . Then, using (III.3)
this substitution, which was what we had to prove.
( ) ( )+ ( )
c) NCD y; z NCD y; x NCD x; z . By distributivity, we
( ) = 1 0 C (y)C0(yC)(yjx) :
NCD x; y
( )+ ( ) ( )+ ( )
have C yz C x C yx C xz . Subtracting C y ()
from both sides, using symmetry, rearranging, and dividing
That is, the NCD between x and y is 1 minus the ratio of the information
()
both sides by C z we obtain
C (yz ) 0 C (y )
C (xyC) (0z)C (x) + C (yzC) (0z)C (y) :
in x about y and the information in y .
Theorem 6.2: If the compressor is normal, then the NCD is a nor- C (z )
malized admissible distance satsifying the metric (in)equalities, that is,
The right-hand side does not decrease when we substitute
a similarity metric.
() ()
C y for the denominator C z of the first term, since
Proof: If the compressor is normal, then by Lemmas 5.1 and 5.3,
the NCD is a normalized admissible distance. It remains to show it () ()
C y C z . Therefore, the inequality stays valid under
satisfies the three metric (in)equalities. this substitution, which was what we had to prove.
1) By idempotency, we have NCD x; x ( )=0 . By monotonicity, we Quasi-universality: We now digress to the theory developed in [31],
( ) 0
have NCD x; y for every x, y , with inequality for y 6 x. = which formed the motivation for developing the NCD. If, instead of the
( )=
2) NCD x; y ( ) NCD y; x . The NCD is unchanged by inter- result of some real compressor, we substitute the Kolmogorov com-
changing x and y in (IV.1). plexity for the lengths of the compressed files in the NCD formula,
3) The difficult property is the triangle inequality. Without loss of the result is the NID as in (IV.1). It is universal in the following sense:
() () ()
generality, we assume C x C y C z . Since the NCD Every admissible distance expressing similarity according to some fea-
is symmetrical, there are only three triangle inequalities that can ture, that can be computed from the objects concerned, is comprised
( ) ( ) ( )
be expressed by NCD x; y , NCD x; z , NCD y; z . We verify (in the sense of minorized) by the NID. Note that every feature of the
them in turn. data gives rise to a similarity, and, conversely, every similarity can be
( )
a) NCD x; y NCD x; z ( )+ ( ) NCD z; y : By distributivity, thought of as expressing some feature: being similar in that sense. Our
the compressor itself satisfies actual practice in using the NCD falls short of this ideal theory in at
( ) + C (z) C (xz) + C (zy):
C xy
least three respects.
i) The claimed universality of the NID holds only for indefinitely
Subtracting C (x) from both sides and rewriting long sequences x, y . Once we consider strings x, y of defi-
C (xy ) 0 C (x) C (xz ) 0 C (x) + C (zy ) 0 C (z ):
nite length n, it is only universal with respect to “simple” com-
putable normalized admissible distances, where “simple” means
Dividing by C (y ) on both sides we find that they are computable by programs of length, say, logarithmic
C (xy ) 0 C (x)
C (xz) 0 C (xC) +(yC) (zy) 0 C (z) :
in n. This reflects the fact that, technically speaking, the univer-
C (y )
sality is achieved by summing the weighted contribution of all
similarity distances in the class considered with respect to the
The left-hand side is 1. objects considered. Only similarity distances of which the com-
i) Assume the right-hand side is 1. Setting C (z ) = plexity is small (which means that the weight is large), with re-
C (y )+1, and adding 1 to both the numerator and de- spect to the size of the data concerned, kick in.
nominator of the right-hand side, it can only increase ii) The Kolmogorov complexity is not computable, and it is in prin-
and draw closer to 1. Therefore, ciple impossible to compute how far off the NCD is from the
C (xy ) 0 C (x) NID. So we cannot in general know how well we are doing using
C (y ) the NCD.
lie between these two values. In order to be able to compare tree scores until it seems no better trees are being found in a reasonable amount of
in a more uniform way, we now rescale the score linearly such that the time, in which case the approximation is complete.
worst score maps to 0, and the best score maps to 1, and term this the Note that if a tree is ever found such that S (T ) = 1, then we can
normalized tree benefit score S (T ) = (M 0 CT )=(M 0 m). Our goal stop because we can be certain that this tree is optimal, as no tree could
1532 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
have a lower cost. In fact, this perfect tree result is achieved in our Recall, that the “similarity machine” we have described consists of
artificial tree reconstruction experiment (Section VII-A) reliably in a two parts: i) extracting a distance matrix from the data, and ii) con-
few minutes. For real-world data, S (T ) reaches a maximum somewhat structing a tree from the distance matrix using our novel quartet-based
less than 1, presumably reflecting distortion of the information in the heuristic.
distance matrix data by the best possible tree representation, as noted Testing the quartet-based tree construction: We first test whether
above, or indicating getting stuck in a local optimum or a search space the quartet-based tree construction heuristic is trustworthy. We gener-
too large to find the global optimum. On many typical problems of up ated a ternary tree T with 18 leaves, using the pseudorandom number
to 40 objects, this tree search gives a tree with S (T ) 0:9 within half generator “rand” of the Ruby programming language, and derived a
an hour. For large numbers of objects, tree scoring itself can be slow metric from it by defining the distance between two nodes as follows:
(as this takes order n4 computation steps), and the space of trees is also given the length of the path from a to b, in an integer number of edges,
large, so the algorithm may slow down substantially. For larger experi- as L(a; b), let
ments, we use a C++/Ruby implementation with message passing inter-
L(a; b) +1
face (MPI, a common standard used on massively parallel computers) d(a; b) =
18
on a cluster of workstations in parallel to find trees more rapidly. We
can consider the graph mapping the achieved S (T ) score as a function except when a = b, in which case d(a; b) = 0. It is easy to verify
of the number of trees examined. Progress occurs typically in a sig- that this simple formula always gives a number between 0 and 1, and
moidal fashion toward a maximal value 1, Fig. 3. is monotonic with path length. Given only the 18 2 18 matrix of
these normalized distances, our quartet method exactly reconstructed
the original tree T represented in Fig. 4, with S (T ) = 1.
A. Three Controlled Experiments
Testing on artificial data: Given that the tree reconstruction method
With the natural data sets we use, one may have the preconception is accurate on clean consistent data, we tried whether the full proce-
(or prejudice) that, say, music by Bach should be clustered together, dure works in an acceptable manner when we know what the outcome
music by Chopin should be clustered together, and so should music by should be like. We used the “rand” pseudorandom number generator
rock stars. However, the preprocessed music files of a piece by Bach from the C programming language standard library under Linux. We
and a piece by Chopin, or the Beatles, may resemble one another more randomly generated 11 separate 1-kB blocks of data where each byte
than two different pieces by Bach—by accident or indeed by design and was equally probable and called these tags. Each tag was associated
copying. Thus, natural data sets may have ambiguous, conflicting, or with a different lower case letter of the alphabet. Next, we generated
counterintuitive outcomes. In other words, the experiments on natural 22 files of 80 kB each, by starting with a block of purely random bytes
data sets have the drawback of not having an objective clear “correct” and applying one, two, three, or four different tags on it. Applying a tag
answer that can function as a benchmark for assessing our experimental consists of ten repetitions of picking a random location in the 80-kB
outcomes, but only intuitive or traditional preconceptions. We discuss file, and overwriting that location with the globally consistent tag that
three experiments that show that our program indeed does what it is is indicated. So, for instance, to create the file referred to in the diagram
supposed to do—at least in artificial situations where we know in ad- by “a,” we start with 80 kB of random data, then pick ten places to copy
vance what the correct answer is. over this random data with the arbitrary 1-kB sequence identified
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1533
as tag a. Similarly, to create file “ab,” we start with 80 kB of random Bergamasque,” downloaded from various repositories on the World-
data, then pick ten places to put copies of tag a, then pick ten more Wide Web; iv) Two Linux x86 ELF executables (the cp and rm com-
places to put copies of tag b (perhaps overwriting some of the a tags). mands), copied directly from the RedHat 9.0 Linux distribution; and
Because we never use more than four different tags, and therefore never v) Two compiled Java class files, generated by ourselves. The com-
place more than 40 copies of tags, we can expect that at least half of the pressor used to compute the NCD matrix was bzip2. As expected, the
data in each file is random and uncorrelated with the rest of the files. program correctly classifies each of the different types of files together
The rest of the file is correlated with other files that also contain tags in with like near like. The result is reported in Fig. 6 with S (T ) equal
common; the more tags in common, the more related the files are. The to the very high confidence value 0:984. This experiment shows the
compressor used to compute the NCD matrix was bzip2. The resulting power and universality of the method: no features of any specific do-
tree is given in Fig. 5; it can be seen that the clustering has occured main of application are used. We believe that there is no other method
exactly as we would expect. The S (T ) score is 0:905. known that can cluster data that is so heterogenous this reliably. This
Testing on heterogenous natural data: We test gross classification is borne out by the massive experiments with the method in [20].
of files based on heterogenous data of markedly different file types:
i) Four mitochondrial gene sequences, from a black bear, polar bear,
VIII. EXPERIMENTAL VALIDATION
fox, and rat obtained from the GenBank database on the World-Wide
Web; ii) Four excerpts from the novel The Zeppelin’s Passenger by We developed the CompLearn Toolkit, Section I, and performed
E. Phillips Oppenheim, obtained from the Project Gutenberg Edition on experiments in vastly different application fields to test the quality
the World-Wide Web; iii) Four MIDI files without further processing; and universality of the method. The success of the method as reported
two from Jimi Hendrix and two movements from Debussy’s “Suite below depends strongly on the judicious use of encoding of the objects
1534 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Fig. 5. Classification of artificial files with repeated 1-kB tags. Not all possiblities are included; for example, file “b” is missing. S (T ) = 0:905.
compared. Here one should use common sense on what a real world events (such as insertions, deletions, point mutations, rearrangements,
compressor can do. There are situations where our approach fails if and inversions) separating two sequences sharing a common ancestor
applied in a straightforward way. For example: comparing text files by will result in the loss of their shared information. Our method (in the
the same authors in different encodings (say, Unicode and 8-b version) form of the CompLearn Toolkit) is a fully automated software tool
is bound to fail. For the ideal similarity metric based on Kolmogorov based on such a distance to compare two genomes.
complexity as defined in [31] this does not matter at all, but for practical 1) Mammalian Evolution: In evolutionary biology, the timing and
compressors used in the experiments it will be fatal. Similarly, in the origin of the major extant placental clades (groups of organisms that
music experiments that follow we use the symbolic MIDI music file have evolved from a common ancestor) continues to fuel debate and
format rather than wave format music files. The reason is that the strings research. Here, we provide evidence by whole mitochondrial genome
resulting from straightforward discretizing the wave form files may be phylogeny for competing hypotheses in two main questions: the
too sensitive to how we discretize. Further research may overcome this grouping of the Eutherian orders, and the Therian hypothesis versus
problem. the Marsupionta hypothesis.
Eutherian orders: We demonstrate (already in [31]) that a whole
mitochondrial genome phylogeny of the Eutherians (placental mam-
A. Genomics and Phylogeny
mals) can be reconstructed automatically from unaligned complete
In recent years, as the complete genomes of various species become mitochondrial genomes by use of an early form of our compression
available, it has become possible to do whole genome phylogeny method, using standard software packages. As more genomic material
(this overcomes the problem that using different targeted parts of the has become available, the debate in biology has intensified concerning
genome, or proteins, may give different trees [36]). Traditional phylo- which two of the three main groups of placental mammals are more
genetic methods on individual genes depended on multiple alignment closely related: Primates, Ferungulates, and Rodents. In [7], the
of the related proteins and on the model of evolution of individual maximum-likelihood method of phylogeny tree reconstruction gave
amino acids. Neither of these is practically applicable to the genome evidence for the (Ferungulates, (Primates, Rodents)) grouping for
level. In absence of such models, a method which can compute the half of the proteins in the mitochondrial genomes investigated, and
shared information between two sequences is useful because biolog- (Rodents, (Ferungulates, Primates)) for the other halves of the mt
ical sequences encode information, and the occurrence of evolutionary genomes. In that experiment they aligned 12 concatenated mitochon-
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1535
Fig. 6. Classification of different file types. Tree agrees exceptionally well with NCD distance matrix: S (T ) = 0:984.
drial proteins, taken from 20 species: rat (Rattus norvegicus), house references about the Eutherian orders, and the far more general one
mouse (Mus musculus), grey seal (Halichoerus grypus), harbor seal about the main orders of the mammals (Eutheria, Metatheria, and
(Phoca vitulina), cat (Felis catus), white rhino (Ceratotherium simum), Prototheria). Note also that adding the extra species from 20 to 24 is an
horse (Equus caballus), finback whale (Balaenoptera physalus), blue addition that biologists are loath to do: both for computational reasons
whale (Balaenoptera musculus), cow (Bos taurus), gibbon (Hylobates and for fear of destabilizing a realistic phylogeny by adding even one
lar), gorilla (Gorilla gorilla), human (Homo sapiens), chimpanzee more species to the computation. Furthermore, in the last mentioned
(Pan troglodytes), pygmy chimpanzee (Pan paniscus), orangutan references we used the special-purpose genome compressor GenCom-
(Pongo pygmaeus), Sumatran orangutan (Pongo pygmaeus abelii), press to determine the distance matrix, and the standard biological
using opossum (Didelphis virginiana), wallaroo (Macropus robustus), software MOLPHY package to reconstruct the phylogeny tree from
and the platypus (Ornithorhynchus anatinus) as outgroup. In [30], [31], the distance matrix. In contrast, in this paper, we conduct a larger ex-
we used the whole mitochondrial genomes of the same 20 species, periment than before, using just the general-purpose compressor bzip2
computing the NCD distances (or a closely related distance in [30]), to obtain the distance matrix, and our new quartet tree reconstruction
using the GenCompress compressor, followed by tree reconstruction method to obtain the phylogeny tree—that is, our own CompLearn
using the neighbor joining program in the MOLPHY package [38] package [9], used without any change in all the other experiments.
to confirm the commonly believed morphology-supported hypothesis Marsupionta and Theria: The extant monophyletic divisions
(Rodents, (Primates, Ferungulates)). Repeating the experiment using of the class Mammalia are the Prototheria (monotremes: mammals
the hypercleaning method [6] of phylogeny tree reconstruction gave the that procreate using eggs), Metatheria (marsupials: mammals that
same result. Here, we repeated this experiment several times using the procreate using pouches), and Eutheria (placental mammals: mam-
CompLearn Toolkit using our new quartet method for reconstructing mals that procreate using placentas). The sister relationships between
trees, and computing the NCD with various compressors (gzip, bzip2, these groups is viewed as the most fundamental question in mam-
PPMZ), again always with the same result. These experiments are not malian evolution [21]. Phylogenetic comparison by either anatomy
reported since they are subsumed by the larger experiment of Fig. 7. or mitochondrial genome has resulted in two conflicting hypotheses:
This is a far larger experiment than the one in [30], [31], and aimed the gene-isolation-supported Marsupionta hypothesis ((Prototheria,
at testing two distinct hypotheses simultaneously: the one in the latter Metatheria), Eutheria) versus the morphology-supported Theria hy-
1536 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Fig. 7. The evolutionary tree built from complete mammalian mtDNA sequences of 24 species, using the NCD matrix of Fig. 9. We have redrawn the tree from
our output to agree better with the customary phylogeny tree format. The tree agrees exceptionally well with the NCD distance matrix: S (T ) = 0:996.
pothesis (Prototheria, (Methateria, Eutheria)), the third possibility Metatheria, and the Prototheria. These are tightly knit groups with
apparently not being held seriously by anyone. There has been a lot of relatively close NCDs. The ferungulates are a much looser group
support for either hypothesis; recent support for the Theria hypothesis with generally distant NCDs. The intergroup distances show that
was given in [21] by analyzing a large nuclear gene (M6P/IG2R), the Prototheria are furthest away from the other groups, followed by
viewed as important across the species concerned, and even more the Metatheria and the rodents. Also the fine structure of the tree is
recent support for the Marsupionta hypothesis was given in [18] by consistent with biological wisdom.
phylogenetic analysis of another sequence from the nuclear gene (18S Hierarchical versus flat clustering: This is a good place to con-
rRNA) and by the whole mitochondrial genome. trast the informativeness of hierarchical clustering with multidimen-
Experimental evidence: To test the Eutherian orders simultane- sional-scaling based clustering using the same NCD matrix, exhibited
ously with the Marsupionta-versus Theria hypothesis, we added four in Fig. 9. The entries give a good example of typical NCD values; we
animals to the above 20: Australian echidna (Tachyglossus aculeatus), truncated the number of decimals from 15 to 3 significant digits to save
brown bear (Ursus arctos), polar bear (Ursus maritimus), using the space. Note that the majority of distances bunch in the range [0:9; 1].
common carp (Cyprinus carpio) as the outgroup. Interestingly, while This is due to the regularities the compressor can perceive. The diag-
onal elements give the self-distance, which, for PPMZ, is not actually
there are many species of Eutheria and Metatheria, there are only three
0, but is off from 0 only in the third decimal. In Fig. 8, we clustered
species of now living Prototheria known: platypus, and two types of
the 24 animals using the NCD matrix by multidimensional scaling as
echidna (or spiny anteater). So our sample of the Prototheria is large. points in two-dimensional Euclidean space. In this method, the NCD
The addition of the new species might be risky in that the addition of matrix of 24 animals can be viewed as a set of distances between points
new relations is known to distort the previous phylogeny in traditional in n-dimensional Euclidean space (n 24), which we want to project
computational genomics practice. With our method, using the full into a two-dimensional Euclidean space, trying to distort the distances
genome and obtaining a single tree with a very high confidence S (T ) between the pairs as little as possible. This is akin to the problem of
value, that risk is not as great as in traditional methods obtaining projecting the surface of the earth globe on a flat two-dimensional map
ambiguous trees with bootstrap (statistic support) values on the edges. with minimal distance distortion. The main feature is the choice of the
The mitochondrial genomes of the total of 24 species we used were measure of distortion to be minimized, [16]. Let the original set of
downloaded from the GenBank database on the World-Wide Web. distances be d1 ; . . . ; dk and the projected distances be d01 ; . . . ; dk0 . In
Each is around 17 000 bases. The NCD distance matrix was computed Fig. 8, we used the distortion measure Kruskall’s stress-1, [24], which
using the compressor PPMZ. The resulting phylogeny, with an almost minimizes
maximal S (T ) score of 0:996 supports anew the currently accepted
grouping (Rodents, (Primates, Ferungulates)) of the Eutherian or- ( (d i 0 0 )2 )
di =
2
di :
ders, and additionally the Marsupionta hypothesis ((Prototheria, i k k
i
Metatheria), Eutheria), see Fig. 7. Overall, our whole-mitochondrial
NCD analysis supports the following hypothesis: Kruskall’s stress-1 equal 0 means no distortion, and the worst value is
Mammalia at most 1 (unless you have a really bad projection). In the projection of
( (primates, ferungulates)(rodents; (Metatheria, Prototheria))) the NCD matrix according to our quartet method, one minimizes the
more subtle distortion S (T ) measure, where 1 means perfect represen-
Eutheria tation of the relative relations between every 4-tuple, and 0 means min-
which indicates that the rodents, and the branch leading to the imal representation. Therefore, we should compare distortion Kruskall
Metatheria and Prototheria, split off early from the branch that led stress-1 with 1 0 S (T ). Fig. 7 has a very good 1 0 S (T ) = 0:04
to the primates and ferungulates. Inspection of the distance matrix and Fig. 8 has a poor Kruskal stress 0:389. Assuming that the compar-
shows that the primates are very close together, as are the rodents, the ison is significant for small values (close to perfect projection), we find
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1537
Fig. 8. Multidimensional clustering of same NCD matrix (Fig. 9) as used for Fig. 7. Kruskal’s stress-1 = 0:389.
Fig. 9. Distance matrix of pairwise NCD. For display purpose, we have truncated the original entries from 15- to 3-decimal precision.
that the multidimensional scaling of this experiment’s NCD matrix is from The Universal Virus Database of the International Committee
formally inferior to that of the quartet tree. This conclusion formally on Taxonomy of Viruses, available on the World-Wide Web. The
justifies the impression conveyed by the figures on visual inspection. SARS virus was downloaded from Canada’s Michael Smith Genome
2) SARS Virus: In another experiment, we clustered the SARS Sciences Centre which had the first public SARS Coronavirus draft
virus after its sequenced genome was made publicly available, in rela- whole genome assembly available for download (SARS TOR2 draft
tion to potential similar virii. The 15 virus genomes were downloaded genome assembly 120403). The NCD distance matrix was com-
1538 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Fig. 10. SARS virus among other virii. Legend: AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain
Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck
adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten;
MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2;
PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus
SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2. S (T ) = 0:988.
Fig. 11. Dendrogram of mitochondrial genomes of fungi using NCD. This represents the distance matrix precisely with S (T ) = 0:999.
puted using the compressor bzip2. The relations in Fig. 10 are very (all yeasts), and two filamentous ascomycetes Hypocrea jecorina and
similar to the definitive tree based on medical-macrobio-genomics Verticillium lecanii. The NCD distance matrix was computed using the
analysis, appearing later in the New England Journal of Medicine, compressor PPMZ. The resulting tree is depicted in Fig. 11. The Inter-
[25]. We depicted the figure in the ternary tree style, rather than the pretation of the fungi researchers is “the tree clearly clustered the as-
genomics-dendrogram style, since the former is more precise for comycetous yeasts versus the two filamentous ascomycetes, thus sup-
visual inspection of proximity relations. porting the current hypothesis on their classification (for example, see
3) Analysis of Mitochondrial Genomes of Fungi: As a pilot for ap- [26]). Interestingly, in a recent treatment of the Saccharomycetaceae,
plications of the CompLearn Toolkit in fungi genomics reasearch, the S. servazii, S. castellii, and C. glabrata were all proposed to belong
group of T. Boekhout, E. Kuramae, V. Robert, of the Fungal Biodiver- to genera different from Saccharomyces, and this is supported by the
sity Center, Royal Netherlands Academy of Sciences, supplied us with topology of our tree as well ([27]).”
the mitochondrial genomes of Candida glabrata, Pichia canadensis, To compare the veracity of the NCD clustering with a more fea-
Saccharomyces cerevisiae, S. castellii, S. servazzii, Yarrowia lipolytica ture-based clustering, we also calculated the pairwise distances as fol-
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1539
Fig. 12. Dendrogram of mitochondrial genomes of fungi using block frequencies. This represents the distance matrix precisely with S (T ) = 0:999.
Fig. 14. Clustering of Russian writers. Legend: I.S. Turgenev, 1818–1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky
1821–1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828–1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V.
Gogol 1809–1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891–1940 [The Master and Margarita,
The Fatefull Eggs, The Heart of a Dog]. S (T ) = 0:949.
Fig. 15. Clustering of Russian writers translated in English. The translator is given in brackets after the titles of the texts. Legend: I.S. Turgenev, 1818–1883 [Father
and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821–1881 [Crime and
Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828–1910 [Anna Karenina (Garnett,
C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809–1852 [Dead Souls (C.J. Hogarth), Taras
Bulba ( G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait and How the Two Ivans Quarrelled ( I.F. Hapgood]; M. Bulgakov 1891–1940 [The Master
and Margarita (R. Pevear, L. Volokhonsky), The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]. S (T ) = 0:953.
files was run through a preprocessor to extract just MIDI Note-On and a time step and the end of a track. This file is then used as input to the
Note-Off events. These events were then converted to a player-piano compression stage for distance matrix calculation and subsequent tree
style representation, with time quantized in 0.05-s intervals. All search. To check whether any important feature of the music was lost
instrument indicators, MIDI control signals, and tempo variations during preprocessing, we played it back from the preprocessed files to
were ignored. For each track in the MIDI file, we calculate two compare it with the original. To the authors the pieces sounded almost
quantities: An average volume and a modal note. The average volume unchanged. The compressor used to compute the NCD matrix of the
is calculated by averaging the volume (MIDI note velocity) of all notes genres tree, Fig. 16, and that of 12-piece music set, Fig. 17 is bzip2.
in the track. The modal note is defined to be the note pitch that sounds For the full range of the music experiments see [10].
most often in that track. If this is not unique, then the lowest such note Before testing whether our program can see the distinctions between
is chosen. The modal note is used as a key-invariant reference point various classical composers, we first show that it can distinguish be-
from which to represent all notes. It is denoted by 0, higher notes are tween three broader musical genres: classical music, rock, and jazz.
denoted by positive numbers, and lower notes are denoted by negative This may be easier than making distinctions “within” classical music.
numbers. A value of 1 indicates a half-step above the modal note, All musical pieces we used are listed in the tables in [10]. For the genre-
and a value of 02 indicates a whole-step below the modal note. The experiment we used, 12 classical pieces consisting of Bach, Chopin,
tracks are sorted according to decreasing average volume, and then and Debussy, 12 jazz pieces, and 12 rock pieces. The tree (Fig. 16)
output in succession. For each track, we iterate through each time that our program came up with has S (T ) = 0:858. The discrimi-
sample in order, outputting a single signed 8-b value for each currently nation between the three genres is reasonable but not perfect. Since
sounding note. Two special values are reserved to represent the end of S (T ) = 0:858, a fairly low value, the resulting tree does not represent
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1541
Fig. 16. Output for the 36 pieces from three music genres. Legend: 12 Jazz: John Coltrane [“Blue Trane,” “Giant Steps,” “Lazy Bird,” “Impressions”]; Miles
Davis [“Milestones,” “Seven Steps to Heaven,” “Solar,” “So What”]; George Gershwin [“Summertime”]; Dizzy Gillespie [“Night in Tunisia”]; Thelonious Monk
[“Round Midnight”]; Charlie Parker [“Yardbird Suite”]; 12 Rock and Pop: The Beatles [“Eleanor Rigby,” “Michelle”]; Eric Clapton [“Cocaine,” “Layla”]; Dire
Straits [“Money for Nothing”]; Led Zeppelin [“Stairway to Heaven”]; Metallica [“One”]; Jimi Hendrix [“Hey Joe,” ”Voodoo Chile”]; The Police [“Every Breath
You Take,” “Message in a Bottle”] Rush [“Yyz”]; 12 Classic: see caption for Fig. 17. S (T ) = 0:858.
the NCD distance matrix very well. Presumably, the information in the a reason that at this point is hidden from us. In fact, running this ex-
NCD distance matrix cannot be represented by a dendrogram of high periment with different compressors and termination conditions con-
S (T ) score. This appears to be a common problem with large (> 25 sistently displayed this anomaly. The small set encompasses the four
or so) natural data sets. Another reason may be that the program termi- movements from Debussy’s “Suite Bergamasque,” four movements of
nated, while trapped in a local optimum. We repeated the experiment book 2 of Bach’s “Wohltemperierte Klavier,” and four preludes from
many times with almost the same results, so that does not appear to Chopin’s “Opus 28.” As one can see in Fig. 17, our program does
be the case. The 11-item subtree rooted at n4 contains 10 of the 12 a pretty good job at clustering these pieces. The S (T ) score is also
jazz pieces, together with a piece of Bach’s “Wohltemporierte Klavier high: 0:968. The four Debussy movements form one cluster, as do the
(WTK).” The other two jazz pieces, Miles Davis’ “So What,” and John four Bach pieces. The only imperfection in the tree, judged by what
Coltrane’s “Giant Steps” are placed elsewhere in the tree, perhaps ac- one would intuitively expect, is that Chopin’s Prélude no. 15 lies a bit
cording to some kinship that now escapes us (but may be identified closer to Bach than to the other three Chopin pieces. This Prélude no
by closer studying of the objects concerned). Of the 12 rock pieces, 15, in fact, consistently forms an odd-one-out in our other experiments
10 are placed in the 12-item subtree rooted at n29, together with a as well. This is an example of pure data mining, since there is some
piece of Bach’s “WTK,” and Coltrane’s “Giant Steps,” while Hendrix’s musical truth to this, as no. 15 is perceived as by far the most eccentric
“Voodoo Chile” and Rush “Yyz” is further away. Of the 12 classical among the 24 Préludes of Chopin’s opus 28.
pieces, 10 are in the 13-item subtrees rooted at the branch n8, n13, n6,
n2, n7, together with Hendrix’s “Voodoo Chile,” Rush’s “Yyz,” and
E. Optical Character Recognition
Miles Davis’ “So What.” Surprisingly, two of the four Bach “WTK”
pieces are placed elsewhere. Yet we perceive the four Bach pieces to Can we also cluster two-dimensional images? Because our method
be very close, both structurally and melodically (as they all come from appears focused on strings this is not straightforward. It turns out that
the monothematic “Wohltemporierte Klavier”). But the program finds scanning a picture in raster row-major order retains enough regularityin
1542 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Fig. 17. Output for the 12-piece set. Legend: J.S. Bach [“Wohltemperierte Klavier II: Preludes and Fugues 1,2”— BachWTK2{F,P}{1,2}]; Chopin [“Préludes
op. 28: 1, 15, 22, 24” —ChopPrel{1,15,22,24}]; Debussy [“Suite Bergamasque,” four movements—DebusBerg{1,2,3,4}]. S (T ) = 0:968.
lines are added at the end of each line. Each character is 128 2 128
pixels. The NCD matrix was computed using the compressor PPMZ.
Fig. 19 shows the clusters obtained. There are 10 of each digit “4,” “5,”
“6,” making a total of 30 items in this experiment. All but one of the
4’s are put in the subtree rooted at n1, all but one of the 5’s are put
in the subtree rooted at n4, and all 6’s are put in the subtree rooted at
n3. The remaining 4 and 5 are in the branch n23, n13 joining n6 and
n3. So 28 items out of 30 are clustered correctly, that is, 93%. In this
experiment, we used only three digits. Using the full set of decimal
digits means that too many objects are involved, resulting in a lower
clustering accuracy. However, we can use the NCD as an oblivious fea-
ture-extraction technique to convert generic objects into finite-dimen-
sional vectors. We have used this technique to train a support vector
machine (SVM) based OCR system to classify handwritten digits by
extracting 80 distinct, ordered NCD features from each input image.
In this initial stage of ongoing research, by our oblivious method of
computing the NCDs to use in the SVM classifier, we achieved a hand-
written single decimal digit recognition accuracy of 87%. The current
state-of-the-art for this problem, after half a century of interactive fea-
ture-driven classification research, is in the upper 90% level [34], [40].
All experiments are benchmarked on the standard NIST Special Data
Fig. 18. Images of handwritten digits used for optical character recognition Base 19. Using the NCD for general classification by compression is
(OCR}.
the subject of a future paper.
both dimensions for the compressor to grasp. A simple task along these F. Astronomy
lines is to cluster handwritten characters. The handwritten characters in As a proof of principle, we clustered data from unknown objects, for
Fig. 18 were downloaded from the NIST Special Data Base 19 (optical example objects that are far away. In [3], observations of the micro-
character recognition (OCR) database) on the World-Wide Web (cur- quasar GRS 1915 + 105 made with the Rossi X-ray Timing Eplorer
rently only available for purchase, at a recent check). Each file in the were analyzed. The interest in this microquasar stems from the fact
data directory contains one digit image, either a four, five, or six. Each that it was the first galactic object to show a certain behavior (super-
pixel is a single character: “#” for a black pixel, “1” for white. New luminal expansion in radio observations). Photonometric observation
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1543
data from X-ray telescopes were divided into short time segments (usu- observations are essentially time series, and our aim was experimenting
ally in the order of 1 min), and these segments have been classified with our method as a pilot to more extensive joint research. Here, the
into a bewildering array of 15 different modes after considerable ef- task was to see whether the clustering would agree with the classifi-
fort. Briefly, spectrum hardness ratios (roughly, “color”) and photon cation above. The NCD matrix was computed using the compressor
count sequences were used to classify a given interval into categories PPMZ. The results are in Fig. 20. We clustered 12 objects, consisting
of variability modes. From this analysis, the extremely complex vari- of three intervals from four different categories denoted as ,
, , in
ability of this source was reduced to transitions between three basic [3, Table I]. In Fig. 20, we denote the categories by the corresponding
states, which, interpreted in astronomical terms, gives rise to an expla- Roman letters D, G, P, and T, respectively. The resulting tree groups
nation of this peculiar source in standard black-hole theory. The data these different modes together in a way that is consistent with the clas-
we used in this experiment were made available to us by M. Klein Wolt sification by experts for these observations. The oblivious compression
(coauthor of [3]) and T. Maccarone, both researchers at the Astronom- clustering corresponds precisely to the laborious feature-driven classi-
ical Institute “Anton Pannekoek” at the University of Amsterdam. The fication in [3]. Further work on clustering of (possibly heterogenous)
1544 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Fig. 20. Sixteen observation intervals of GRS 1915+105 from four classes. The initial capital letter indicates the class corresponding to Greek lower case letters
in [3]. The remaining letters and digits identify the particular observation interval in terms of finer features and identity. The T -cluster is top left, the P -cluster is
bottom left, the G-cluster is to the right, and the D -cluster in the middle. This tree almost exactly represents the underlying NCD distance matrix: S (T ) = 0:994.
time series and anomaly detection, using the new compression method, ACKNOWLEDGMENT
was recently done on a massive scale in [20].
The authors wish to thank Loredana Afanasiev, Graduate School of
Logic, University of Amsterdam; Teun Boekhout, Eiko Kuramae, Vin-
IX. CONCLUSION cent Robert, Fungal Biodiversity Center, Royal Netherlands Academy
To interpret what the NCD is doing, and to explain its remarkable of Sciences; Marc Klein Wolt, Thomas Maccarone, Astronomical Insti-
accuracy and robustness across application fields and compressors, the tute “Anton Pannekoek,” University of Amsterdam; Evgeny Verbitskiy,
intuition is that the NCD minorizes all similarity distances based on Philips Research; Steven de Rooij, Ronald de Wolf, CWI; the referees
features that are captured by the reference compressor involved. Such and the editors, for suggestions, comments, help with experiments, and
features must be relatively simple in the sense that they are expressed data; Jorma Rissanen and Boris Ryabko for discussions, John Lang-
by an aspect that the compressor analyzes (for example, frequencies, ford for suggestions, Tzu-Kuo Huang for pointing out some typos and
matches, repeats). Certain sophisticated features may well be express- simplifications, and Teemu Roos and Henry Tirri for implementing a
ible as combinations of such simple features, and are therefore them- visualization of the clustering process.
selves simple features in this sense. The extensive experimenting above
shows that even elusive features are captured.
A potential application of our nonfeature (or rather, many-unknown- REFERENCES
feature) approach is exploratory. Presented with data for which the fea- [1] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,”
tures are as yet unknown, certain dominant features governing simi- Phys. Rev. Lett., vol. 88, no. 4, pp. 048702-1–048702-4, 2002.
larity are automatically discovered by the NCD. Examining the data un- [2] P. Ball, “Algorithm makes tongue tree,” Nature, Science update, Jan. 22,
derlying the clusters may yield this hitherto unknown dominant feature. 2002.
[3] T. Belloni, M. Klein-Wolt, M. Méndez, M. van der Klis, and J. van
Our experiments indicate that the NCD has application in two new Paradijs, “A model-independent analysis of the variability of GRS
areas of support vector machine (SVM) based learning. First, we find 1915 + 105,” Astron. Astrophys., vol. 355, pp. 271–290, 2000.
that the complemented NCD (10NCD) is useful as a kernel for generic [4] C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. Zurek, “Informa-
objects in SVM learning. Secondly, we can use the normal NCD as a tion distance,” IEEE Trans. Inf. Theory, vol. 44, no. 4, pp. 1407–1423,
feature-extraction technique to convert generic objects into finite-di- Jul. 1998.
mensional vectors, see the last paragraph of Section VIII-E. In effect, [5] C. H. Bennett, M. Li, and B. Ma, “Chain letters and evolutionary histo-
ries,” Scient. Amer., pp. 76–81, Jun. 2003.
our similarity engine aims at the ideal of a perfect data mining process,
[6] D. Bryant, V. Berry, P. Kearney, M. Li, T. Jiang, T. Wareham, and H.
discovering unknown features in which the data can be similar. This Zhang, “A practical algorithm for recovering the best supported edges
is the subject of ongoing joint research in genomics of fungi, clinical of an evolutionary tree,” in Proc. 11th ACM-SIAM Symp. Discrete Algo-
molecular genetics, and radio astronomy. rithms, San Francisco, CA, Jan. 9–11, 2000, pp. 287–296.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1545
[7] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, [30] M. Li and P. M. B. Vitányi, “Algorithmic complexity,” in International
N. Okada, S. Pääbo, and M. Hasegawa, “Conflict among individual mi- Encyclopedia of the Social & Behavioral Sciences, N. J. Smelser and P.
tochondrial proteins in resolving the phylogeny of eutherian orders,” J. B. Baltes, Eds. Oxford, U.K.: Pergamon, 2001/2002, pp. 376–382.
Mol. Evol., vol. 47, pp. 307–322, 1998. [31] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity
[8] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared infor- metric,” IEEE Trans. Inf. Theory, vol. 50, no. 12, pp. 3250–3264, Dec.
mation and program plagiarism detection,” IEEE Trans. Inf. Theory, vol. 2004.
50, no. 7, pp. 1545–1551, Jul. 2004. [32] M. Li and P. M. B. Vitányi, An Introduction to Kolmogorov Complexity
[9] R. Cilibrasi. (2003) The CompLearn Toolkit. [Online]. Available: and Its Applications, 2nd ed. New York: Springer-Verlag, 1997.
http://complearn.sourceforge.net/ [33] A. Londei, V. Loreto, and M. O. Belardinelli, “Music style and author-
[10] R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf, “Algorithmic clustering ship categorization by informative compressors,” in Proc. 5th Triannual
of music based on string compression,” Comp. Music J., vol. 28, no. 4, Conf. European Society for the Cognitive Sciences of Music (ESCOM),
pp. 49–67, 2004. Hannover, Germany, Sep. 8–13, 2003, pp. 200–203.
[11] G. Cormode, M. Paterson, S. Sahinalp, and U. Vishkin, “Communication [34] L. S. Oliveira, R. Sabourin, F. Bortolozzi, and C. Y. Suen, “Automatic
complexity of document exchange,” in Proc. 11th ACM-SIAM Symp. recognition of handwritten numerical strings: A recognition and verifi-
Discrete Algorithms, San Francisco, CA, Jan. 9–11, 2000, pp. 197–206. cation strategy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 11,
[12] T. M. Cover and J. A. Thomas, Elements of Information Theory. New pp. 1438–1454, Nov. 2002.
York: Wiley, 1991. [35] United Nations General Assembly Resolution 217 A (III) of 10 De-
[13] W. Chai and B. Vercoe, “Folk music classification using hidden Markov cember 1948. Universal Declaration of Human Rights. [Online]. Avail-
model,” in Proc. Int. Conf. Artificial Intelligence, Las Vegas, NV, Jun. able: http://www.un.org/Overview/rights.html
2001. [36] A. Rokas, B. L. Williams, N. King, and S. B. Carroll, “Genome-scale
[14] M. Cooper and J. Foote, “Automatic music summarization via similarity approaches to resolving incongruence in molecular phylogenies,” Na-
analysis,” in Proc. IRCAM, Paris, France, 2002, pp. 81–85. ture, vol. 425, pp. 798–804, Oct. 2003.
[15] R. Dannenberg, B. Thom, and D. Watson, “A machine learning approach [37] D. Salomon, Data Compression. New York: Springer-Verlag, 1997.
to musical style recognition,” in Proc. Int. Computer Music Conf., Thes- [38] N. Saitou and M. Nei, “The neighbor-joining method: A new method for
saloniki, Greece, Sep. 25–30, 1997, pp. 344–347. reconstructing phylogenetic trees,” Mol. Biol. Evol., vol. 4, pp. 406–425,
[16] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. 1987.
New York: Wiley Interscience, 2001. [39] P. Scott. (2001) Music Classification Using Neural Networks. [Online].
[17] M. Grimaldi, A. Kokaram, and P. Cunningham. (2002) Classi- Available: http://www.stanford.edu/class/ee373a/musicclassification.
fying Music by Genre Using the Wavelet Packet Transform and a pdf
Round-Robin Ensemble. Trinity College, Dublin, U.K.. [Online]. Avail- [40] Ø. D. Trier, A. K. Jain, and T. Taxt, “Feature extraction methods for char-
able: http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCD- acter recognition—A survey,” Patt. Recogn., vol. 29, no. 4, pp. 641–662,
CS-2002-64.pdf 1996.
[18] A. Janke, O. Magnell, G. Wieczorek, M. Westerman, and U. Arnason, [41] P. N. Yianilos. (1991) Normalized Forms for Two Common Metrics.
“Phylogenetic analysis of 18S rRNA and the mitochondrial genomes NEC Res. Inst.. [Online]. Available: http://www.pnylab.com/pny/
of wombat, vombatus ursinus, and the spiny anteater, tachyglossus ace- [42] A. C.-C. Yang, C.-K. Peng, H.-W. Yien, and A. L. Goldberger, “Infor-
laetus: Increased support for the marsupionta hypothesis,” J. Mol. Evol., mation categorization approach to literary authorship disputes,” Physica
vol. 54, no. 1, pp. 71–80, 2002. A, vol. 329, pp. 473–483, 2003.
[19] T. Jiang, P. Kearney, and M. Li, “A polynomial time approximation [43] G. Tzanetakis and P. Cook, “Music genre classification of audio signals,”
scheme for inferring evolutionary trees from quartet topologies and its IEEE Trans. Speech and Audio Process., vol. 10, no. 5, pp. 293–302, Jul.
application,” SIAM J. Comput., vol. 30, no. 6, pp. 1942–1961, 2001. 2002.
[20] E. Keogh, S. Lonardi, and C. A. Rtanamahatana, “Toward parameter- [44] S. Wehner. (2004) Analyzing Network Traffic and Worms Using Com-
free data mining,” in Proc. 10th ACM SIGKDD Int. Conf. Knowledge pression. CWI. [Online]. Available: http://homepages.cwi.nl/~wehner/
Discovery and Data Mining, Seattle, WA, Aug. 22–25, 2004, pp. worms/
206–215.
[21] J. K. Killian, T. R. Buckley, N. Steward, B. L. Munday, and R. L. Jirtle,
“Marsupials and eutherians reunited: Genetic evidence for the theria hy-
pothesis of mammalian evolution,” Mammalian Genome, vol. 12, pp.
513–517, 2001.
[22] M. Koppel, S. Argamon, and A. R. Shimoni, “Automatically catago-
rizing written texts by author gender,” Literary and Linguistic Comput.,
vol. 17, no. 4, pp. 401–412, 2002.
[23] A. Kraskov, H. Stögbauer, R. G. Adrsejak, and P. Grassberger. (2003)
Hierarchical Clustering Based on Mutual Information. [Online]. Avail-
able: http://arxiv.org/abs/q-bio/0311039
[24] J. B. Kruskal, “Nonmetric multidimensional scaling: A numerical
method,” Psychometrika, vol. 29, pp. 115–129, 1964.
[25] T. G. Ksiazek et al., “A novel coronavirus associated with severe acute
respiratory syndrome,” New England J. Med., vol. 349, p. 709, Aug. 14,
2003.
[26] C. P. Kurtzman and J. Sugiyama, “Ascomycetous yeasts and yeast-like
taxa,” in The Mycota VII, Systemtics and Evolution. Berlin, Germany:
Springer-Verlag, 2001, pt. A, pp. 179–200.
[27] C. P. Kurtzman, “Phylogenetic circumscription of saccharomyces,
kluyveromyces and other members of the saccharomycetaceaea, and
the proposal of the new genera lachnacea, nakaseomyces, naumovia,
vanderwaltozyma and zygotorulaspora,” FEMS Yeast Res., vol. 4, pp.
233–245, 2003.
[28] P. S. Laplace, A Philosophical Essay on Probabilities. New York:
Dover, 1951. English translation of the 1918 French original version.
[29] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang,
“An information-based sequence distance and its application to whole
mitochondrial genome phylogeny,” Bioinformatics, vol. 17, no. 2, pp.
149–154, 2001.