Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO.
4, APRIL 2005 1523
Clustering by Compression its universality and robustness by experiments and results in a plethora
of different areas using different types of compressors.
Rudi Cilibrasi and Paul M. B. Vitányi Feature-Based Similarities: We are presented with unknown data
and the question is to determine the similarities among them and group
Abstract—We present a new method for clustering based on compression. like with like together. Commonly, the data are of a certain type: music
The method does not use subject-specific features or background knowl- files, transaction records of ATM machines, credit card applications,
edge, and works as follows: First, we determine a parameter-free, universal, genomic data. In these data, there are hidden relations that we would
similarity distance, the normalized compression distance or NCD, com- like to get out into the open. For example, from genomic data one can
puted from the lengths of compressed data files (singly and in pairwise con- extract letter or block frequencies (the blocks are over the four-letter
catenation). Second, we apply a hierarchical clustering method. The NCD
is not restricted to a specific application area, and works across applica- alphabet); from music files one can extract various specific numerical
tion area boundaries. A theoretical precursor, the normalized information features, related to pitch, rhythm, harmony, etc. One can extract such
distance, co-developed by one of the authors, is provably optimal. How- features using, for instance, Fourier transforms [43] or wavelet trans-
ever, the optimality comes at the price of using the noncomputable notion forms [17], to quantify parameters expressing similarity. The resulting
of Kolmogorov complexity. We propose axioms to capture the real-world
setting, and show that the NCD approximates optimality. To extract a hi-
vectors corresponding to the various files are then classified or clus-
erarchy of clusters from the distance matrix, we determine a dendrogram tered using existing classification software, based on various standard
(ternary tree) by a new quartet method and a fast heuristic to implement it. statistical pattern recognition classifiers [43], Bayesian classifiers [15],
The method is implemented and available as public software, and is robust hidden Markov models [13], ensembles of nearest neighbor classifiers
under choice of different compressors. To substantiate our claims of univer- [17], or neural networks [15], [39]. For example, in music, one fea-
sality and robustness, we report evidence of successful application in areas
as diverse as genomics, virology, languages, literature, music, handwritten ture would be to look for rhythm in the sense of beats per minute. One
digits, astronomy, and combinations of objects from completely different can make a histogram where each histogram bin corresponds to a par-
domains, using statistical, dictionary, and block sorting compressors. In ticular tempo in beats-per-minute and the associated peak shows how
genomics, we presented new evidence for major questions in Mammalian frequent and strong that particular periodicity was over the entire piece.
evolution, based on whole-mitochondrial genomic analysis: the Eutherian
orders and the Marsupionta hypothesis against the Theria hypothesis.
In [43], we see a gradual change from a few high peaks to many low
and spread-out ones going from hip-hip, rock, jazz, to classical. One
Index Terms—Heterogenous data analysis, hierarchical unsupervised can use this similarity type to try to cluster pieces in these categories.
clustering, Kolmogorov complexity, normalized compression distance,
parameter-free data mining, quartet tree method, universal dissimilarity
However, such a method requires specific and detailed knowledge of
distance. the problem area, since one needs to know what features to look for.
Non-Feature Similarities: Our aim is to capture, in a single simi-
larity metric, every effective distance: effective versions of Hamming
I. INTRODUCTION distance, Euclidean distance, edit distances, alignment distance,
All data are created equal but some data are more alike than others. Lempel–Ziv distance [11], and so on. This metric should be so general
We propose a method expressing this alikeness, using a new similarity that it works in every domain: music, text, literature, programs,
metric based on compression. It is parameter-free in that it does not genomes, executables, natural language determination, equally and si-
use any features or background knowledge about the data, and can multaneously. It would be able to simultaneously detect all similarities
without changes be applied to different areas and across area bound- between pieces that other effective distances can detect seperately.
aries. It is universal in that it approximates the parameter expressing Compression-Based Similarity: Such a “universal” metric was
similarity of the dominant feature in all pairwise comparisons. It is ro- co-developed by us in [29]–[31], as a normalized version of the “infor-
bust in the sense that its success appears independent from the type mation metric” of [4], [32]. Roughly speaking, two objects are deemed
of compressor used. The clustering we use is hierarchical clustering close if we can significantly “compress” one given the information in
in dendrograms based on a new fast heuristic for the quartet method. the other, the idea being that if two pieces are more similar, then we
The method is available as an open-source software tool. Below we ex- can more succinctly describe one given the other. The mathematics
plain the method, the theory underpinning it, and present evidence for used is based on Kolmogorov complexity theory [32]. In [31], we
defined a new class of (possibly nonmetric) distances, taking values
in [0; 1] and appropriate for measuring effective similarity relations
Manuscript received March 31, 2003; revised December 29, 2004. The mate- between sequences, say one type of similarity per distance, and vice
rial in this correspondence was presented in part at the IEEE International Sym-
posium on Information Theory, Yokohama, Japan, June/July 2003. The work of versa. It was shown that an appropriately “normalized” information
R. Cilibrasi was supported in part by The Netherlands BSIK/BRICKS Project distance minorizes every distance in the class. It discovers all effective
and by the NWO Project 612.55.002. The work of P. M. B. Vitányi was sup- similarities in the sense that if two objects are close according to some
ported in part by the EU Project RESQ, IST-2001-37559, the NoE QUIPRO- effective similarity, then they are also close according to the normal-
CONE IST-1999-29064, the ESF QiT Programmme, and the EU NoE PASCAL,
The Netherlands BSIK/BRICKS Project, and the KRR and SML&KA Programs
ized information distance. Put differently, the normalized information
of National ICT of Australia. Part of his work was performed while P. M. B. distance represents similarity according to the dominating shared
Vitányi was on sabbatical leave at the National ICT of Australia, Sydney Labo- feature between the two objects being compared. In comparisons of
ratory at the University of New South Wales. more than two objects, different pairs may have different dominating
R. Cilibrasi is with the Centre for Mathematics and Computer Science (CWI),
features. The normalized information distance is a metric and takes
P. O. Box 94079, NL-1090 GB Amsterdam, The Netherlands (e-mail: Rudi.Cili-
[email protected]). values in [0; 1]; hence, it may be called the similarity metric. To
P. M. B. Vitányi is with the Centre for Mathematics and Computer Sci- apply this ideal precise mathematical theory in real life, we have to
ence (CWI), P. O. Box 94079, NL-1090 GB Amsterdam, The Netherlands replace the use of the noncomputable Kolmogorov complexity by
(e-mail:[email protected]). He is also affiliated with the University of an approximation using a standard real-world compressor. Earlier
Amsterdam and National ICT of Australia.
Communicated by J. C. Kieffer, W. Szpankowski, and E.-h. Yang, Guest Ed- approaches resulted in the first completely automatic construction of
itors for the Special Issue on Problems on Sequences. the phylogeny tree based on whole mitochondrial genomes, [29]–[31],
Digital Object Identifier 10.1109/TIT.2005.844059 a completely automatic construction of a language tree for over 50
0018-9448/$20.00 © 2005 IEEE

1524 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005
Euro-Asian languages [31], detects plagiarism in student programming well the resulting tree represents the information in the distance ma-
assignments [8], gives phylogeny of chain letters [5], and clusters trix on a scale of 0 to 1. Then, as proof of principle, we run the pro-
music [10]. Moreover, the method turns out to be robust under change gram on three data sets, where we know what the final answer should
of the underlying compressor-types: statistical (PPMZ), Lempel–Ziv be: i) reconstruct a tree from a distance matrix obtained from a ran-
based dictionary (gzip), block based (bzip2), or special purpose domly generated tree; ii) reconstruct a tree from files containing arti-
(Gencompress). ficial similarities; and iii) reconstruct a tree from natural files of het-
Related Work: In view of the simplicity and naturalness of our erogenous data of vastly different types. To substantiate our claim of
proposal, it is perhaps surprising that compression-based clustering parameter-freeness and universality, we apply the method to different
and classification approaches did not arise before. But recently there areas, not using any feature analysis at all. We first give an example
have been several partially independent proposals in that direction: [1], in whole-genome phylogeny using the whole mitochondrial DNA of
[2] for author attribution and building language trees—while citing the the species concerned. We compare the hierarchical clustering of our
earlier work [4], [32]—does not develop a theory based on information method with a more standard method of two-dimensional clustering (to
distance but proceeds by more ad hoc arguments related to the com- show that our dendrogram method of depicting the clusters is more in-
pressibility of a target file after first compressing a reference file. The formative). We give a whole-genome phylogeny of fungi and compare
better the target file compresses, the more we feel it is similar to the this to results using alignment of selected proteins (alignment being
reference file in question. See also the explanation in [31, Appendix I]. often too costly to perform on the whole-mitochondial genome, but the
This approach is used also to cluster music MIDI (Musical Instrument disadvantage of protein selection being that different selections usually
Digital Interface, a versatile digital music format available on the result in different phylogenies—so which is right?). We identify the
World-Wide Web) files by Kohonen maps in [33]. Another recent virii that are closest to the sequenced SARS virus; we give an example
offshoot based on our work is hierarchical clustering based on mutual of clustering of language families; Russian authors in the original Rus-
information, [23]. In a related, but considerably simpler feature-based sian, the same pieces in English translation (clustering partially follows
approach, one can compare the word frequencies in text files to assess the translators); clustering of music in MIDI format; clustering of hand-
similarity. In [42], the word frequencies of words common to a pair of written digits used for optical character recognition; and clustering of
text files are used as entries in two vectors, and the similarity of the radio observations of a mysterious astronomical object, a microquasar
two files is based on the distance between those vectors. The authors of extremely complex variability. In all these cases, the method per-
attribute authorship to Shakespeare plays, the Federalist Papers, and forms very well in the following sense: The method yields the phy-
the Chinese classic “The Dream of the Red Chamber.” This approach logeny of 24 species agreeing with biological wisdom insofar as it is un-
based on block occurrence statistics is standard in genomics, but in an controversial. The probability that it randomly would hit this outcome,
experiment reported in [31] gives inferior phylogeny trees compared or anything reasonably close, is very small. In clustering 36 music
to our compression method (and wrong ones according to current pieces taken equally from pop, jazz, and classic, so that 12–12–12 is
biological wisdom). A related, opposite, approach was taken in [22], the grouping we understand is correct, we can identify convex clusters
where literary texts are clustered by author gender or fact versus so that only six errors are made. (That is, if three items get dislodged
fiction, essentially by first identifying distinguishing features, like without two of them being interchanged, then six items get misplaced.)
gender-dependent word usage, and then classifying according to those The probability that this happens by chance is extremely small. The
features. Apart from the experiments reported here, the clustering reason why we think the method does something remarkable is con-
by compression method reported in this correspondence has recently cisely put by Laplace [28]:
been used to analyze network traffic and cluster computer worms and
“If we seek a cause wherever we perceive symmetry, it is not that
virusses [44]. Finally, recent work [20] reports experiments with our
we regard the symmetrical event as less possible than the others,
method on all time sequence data used in all the major data-mining
but, since this event ought to be the effect of a regular cause or
conferences in the last decade. Comparing the compression method
that of chance, the first of these suppositions is more probable
with all major methods used in those conferences they established clear
than the second. On a table we see letters arranged in this order
superiority of the compression method for clustering heterogenous
Constantinople, and we judge that this arrangement is not
data, and for anomaly detection. See also the explanation in [31,
the result of chance, not because it is less possible than others,
Appendix II].
for if this word were not employed in any language we would not
Outline: Here we propose a first comprehensive theory of real-world
suspect it came from any particular cause, but this word being in
compressor-based normalized compression distance, a novel hierar-
use among us, it is incomparably more probable that some person
chical clustering heuristic, together with several applications. First, we
has thus arranged the aforesaid letters than that this arrangement
propose mathematical notions of “admissible distance” (using the term
is due to chance.”
for a wider class than we did in [31]), “normalized admissible distance”
or “similarity distance,” “normal compressor,” and “normalized com- Materials and Methods: The data samples we used were obtained
pression distance.” We then prove the normalized compression distance from standard databases accessible on the World-Wide Web, generated
based on a normal compressor to be a similarity distance satisfying the by ourselves, or obtained from research groups in the field of inves-
metric (in)equalities. The normalized compression distance is shown tigation. We supply the details with each experiment. The method of
to be quasi-universal in the sense that it minorizes every computable processing the data was the same in all experiments. First, we prepro-
similarity distance up to an error that depends on the quality of the cessed the data samples to bring them in appropriate format: the ge-
compressor’s approximation of the true Kolmogorov complexities of nomic material over the four-letter alphabet fA; T ; G; C g is recoded
the files concerned. This means that the NCD captures the dominant in a four-letter alphabet; the music MIDI files are stripped of iden-
similarity over all possible features for every pair of objects compared, tifying information such as composer and name of the music piece.
up to the stated precision. Note that different pairs of objects may have Then, in all cases, the data samples were completely automatically pro-
different dominant shared features. Next, we present a method of hi- cessed by our CompLearn Toolkit, rather than as is usual in phylogeny,
erarchical clustering based on a novel fast randomized hill-climbing by using an ecclectic set of software tools per experiment. Oblivious
heuristic of a new quartet tree optimization criterion. Given a matrix to the problem area concerned, simply using the distances according
of the pairwise similarity distances between the objects, we score how to the NCD below, the method described in this correspondence fully
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO. 4, APRIL 2005 1525
automatically classifies the objects concerned. The method has been with phylogeny hypotheses, the clustering can give true support of
released into the public domain as open-source software. The Com- one hypothesis against another one.
pLearn Toolkit [9] is a suite of simple utilities that one can use to apply Figures: We use two styles to display the hierarchical clusters. In
compression techniques to the process of discovering and learning pat- the case of genomics of Eutherian orders and fungi, language trees,
terns in completely different domains. In fact, this method is so general it is convenient to follow the dendrograms that are customary in
that it requires no background knowledge about any particular subject that area (suggesting temporal evolution) for easy comparison with
area. There are no domain-specific parameters to set, and only a handful the literature. Although there is no temporal relation intended, the
of general settings. dendrogram representation looked also appropriate for the Russian
The CompLearn Toolkit using NCD and not, say, alignment, can writers, and translations of Russian writers. In the other experiments
cope with full genomes and other large data files and thus comes up (even the genomic SARS experiment) it is more informative to
with a single distance matrix. The clustering heuristic generates a tree display an unrooted ternary tree (or binary tree if we think about
with a certain confidence, called standardized benefit score or S (T ) incoming and outgoing edges) with explicit internal nodes. This
value in the sequel. Generating trees from the same distance matrix facilitates identification of clusters in terms of subtrees rooted at
many times resulted in the same tree in case of high S (T ) value, or internal nodes or contiguous sets of subtrees rooted at branches of
a similar tree in case of moderately high S (T ) value, for all distance internal nodes.
matrices we used, even though the heuristic is randomized. That is,
there is only one way to be right, but increasingly many ways to be
increasingly wrong which can all be realized by different runs of the II. SIMILARITY DISTANCE
randomized algorithm. This is a big difference compared to previous We give a precise formal meaning to the loose distance notion of
phylogeny methods, where because of computational limitations one “degree of similarity” used in the pattern recognition literature.
uses only parts of the genome, or certain proteins that are viewed as
significant [21]. These are run through a tree reconstruction method A. Distance and Metric
like neighbor joining [38], maximum likelihood, maximum evolution,
maximum parsimony as in [21], or quartet hypercleaning [6], many Let
be a nonempty set and R+ be the set of nonnegative real
times. The percentage-wise agreement on certain branches arising are numbers. A distance function on
is a function D :
2
! R+ . It
called “bootstrap values.” Trees are depicted with the best bootstrap is a metric if it satisfies the following metric (in)equalities:
values on the branches that are viewed as supporting the theory tested. • D(x; y ) = 0 iff x = y ,
Different choices of proteins result in different best trees. One way • D(x; y ) = D(y; x) (symmetry), and
to avoid this ambiguity is to use the full genome, [31], [36], leading • D(x; y ) D(x; z ) + D(z; y ) (triangle inequality).
to whole-genome phylogeny. With our method we do whole-genome The value D(x; y ) is called the distance between x; y 2
. A familiar
phylogeny, and end up with a single overall best tree, not optimizing example of a distance that is also metric is the Euclidean metric,
the everyday distance e(a; b) between two geographical objects a, b
selected parts of it.
The quality of the results depends on a) the NCD distance matrix,
expressed in, say, meters. Clearly, this distance satisfies the properties
e(a; a) = 0, e(a; b) = e(b; a), and e(a; b) e(a; c) + e(c; b) (for
and b) on how well the hierarchical tree represents the information in
the matrix. The quality of b) is measured by the S (T ) value, and is
instance, a = Amsterdam, b = Brussels, and c = Chicago). We are
given with each experiment. In general, the S (T ) value deteriorates
interested in a particular type of distance, the “similarity distance,”
for large sets. We believe this to be partially an artifact of a low-
which we formally define in Definition 2.5. For example, if the
resolution NCD matrix due to limited compression power, and limited
objects are classical music pieces then the function D defined by
file size. The main reason, however, is the fact that with increasing
D (a; b) = 0 if a and b are by the same composer and D (a; b) =
size of a natural data set the projection of the information in the
NCD matrix into a binary tree unavoidably gets increasingly distorted. 1 otherwise, is a similarity distance that is also a metric. This
Another aspect limiting the quality of the NCD matrix is more subtle. metric captures only one similarity aspect (feature) of music pieces,
Recall that the method knows nothing about any of the areas we apply presumably an important one that subsumes a conglomerate of more
it to. It determines the dominant feature as seen through the NCD elementary features.
filter. The dominant feature of alikeness between two files may not
correspond to our a priori assumption but may have an unexpected B. Admissible Distance
cause. The results of our experiments suggest that this is not often In defining a class of admissible distances (not necessarily metric
the case: In the natural data sets where we have preconceptions of the distances) we want to exclude unrealistic ones like f (x; y ) = 12 for
outcome, for example, that works by the same authors should cluster every pair x 6= y . We do this by restricting the number of objects within
together, or music pieces by the same composers, musical genres, a given distance of an object. As in [4], we do this by only considering
or genomes, the outcomes conform largely to our expectations. For effective distances, as follows: Fix a suitable, and for the remainder of
example, in the music genre experiment, the method would fail
the paper, fixed, programming language. This is the reference program-
dramatically if genres were evenly mixed, or mixed with little bias.
ming language.
However, to the contrary, the separation in clusters is almost perfect.
The few misplacements that are discernable are either errors (the Definition 2.1: Let
= 63 , with 6 a finite nonempty alphabet
method was not powerful enough to discern the dominant feature), and 63 the set of finite strings over that alphabet. Since every finite
the distortion due to mapping multidimensional distances into tree alphabet can be recoded in binary, we choose 6 = f0; 1g. In partic-
distances, or the dominant feature between a pair of music pieces ular, “files” in computer memory are finite binary strings. A function
is not the genre but some other aspect. The surprising news is D :
2
! R
+ is an admissible distance if for every pair of objects
that we can generally confirm expectations with few misplacements, x; y 2
the distance D (x; y ) is the length of a binary prefix codeword
indeed, that the data do not contain unknown rogue features that that is a program that computes x from y , and vice versa, in the refer-
dominate to cause spurious (in our preconceived idea) clustering. ence programming language. This implies that an admissible distance
This gives evidence that where the preconception is in doubt, like is symmetric.
If D is an admissible distance, then for every x, the set fD(x; y ) : It follows from the definitions that a normalized admissible distance
y 2 f0; 1g3 g is the length set of a prefix code. Hence, it satisfies by is a function d :
2
! [0; 1] that is symmetric: d(x; y ) = d(y; x).
Lemma 2.6: For every x 2
, and constant e 2 [0; 1], a normalized
the Kraft inequality, see [12],
20D(x;y) 1: (II.1)
admissible distance satisfies the density constraint
jfy : d(x; y) egj < 2

y
eD (x)+1
: (II.2)
Example 2.2: In representing the Hamming distance d between
Proof: Assume to the contrary that d does not satisfy (II.2). Then,
two strings of equal length n differing in positions i1 ; . . . ; id ,
there is an e 2 [0; 1], such that (II.2) is false. We first note that, since
we can use a simple prefix-free encoding of (n; d; i1 ; . . . ; id ) in
2 log n + 4 log log n + 2 + d log n bits. We encode n and d prefix-free D(x; y) is an admissible distance that satisfies (II.1), d(x; y) satisfies
in log n + 2 log log n + 1 bits each, see, e.g., [32], and then the
a “normalized” version of the Kraft inequality
literal indexes of the actual flipped-bit positions. Adding an O(1)-bit
program to interpret these data, with the strings concerned being x 20 ( d x;y )D (x)
20 (d x;y )D (x;y )
1: (II.3)
and y , we have defined y y
H (x; y) = 2 log n + 4 log log n + d log n + O(1)

n
Starting from (II.3), we obtain the required contradiction
as the length of a prefix code word (prefix program) to compute x from 1 20 (

d x;y )D (x)
20 eD (x)
y and vice versa. Then, by the Kraft inequality, y y:d(x;y )e

2 eD (x)+1
20 eD (x)
> 1:
20H (x;y )
1:
If d(x; y ) is the
y
normalized version of an admissible distance
It is easy to verify that Hn is a metric in the sense that it satisfies the D(x; y) then (II.3) is equivalent to (II.1). We call a normalized
metric (in)equalities up to O(log n) additive precision. distance a “similarity” distance, because it gives a relative similarity
according to the distance (with distance 0 when objects are maximally
C. Normalized Admissible Distance similar and distance 1 when they are maximally dissimilar) and,
Large objects (in the sense of long strings) that differ by a tiny part conversely, for every well-defined computable notion of similarity
are intuitively closer than tiny objects that differ by the same amount. we can express it as normalized admissible distance according to our
For example, two whole mitochondrial genomes of 18 000 bases that definition. In the literature, a distance that expresses lack of similarity
differ by 9000 are very different, while two whole nuclear genomes of (like ours) is often called a “dissimilarity” distance or a “disparity”
3 2 109 bases that differ by only 9000 bases are very similar. Thus, distance.
absolute difference between two objects does not govern similarity, but Remark 2.7: As far as the authors know, the idea of normalized
relative difference appears to do so. metric is, surprisingly, not well studied. An exception is [41], which
Definition 2.3: A compressor is a lossless encoder mapping
into investigates normalized metrics to account for relative distances rather
f0; 1g3 such that the resulting code is a prefix code. “Lossless” means than absolute ones, and it does so for much the same reasons as in
the present work. An example is the normalized Euclidean metric jx 0
yj=(jxj + jyj), where x; y 2 Rn (R denotes the real numbers) and j1j is
that there is a decompressor that reconstructs the source message
from the code message. For convenience of notation we identify
“compressor” with a “codeword length function” C :
! N , where the Euclidean metric—the L2 norm. Another example is a normalized
N is the set of nonnegative integers. That is, the compressed version symmetric-set-difference metric. But these normalized metrics are not
of a file x has length C (x). We only consider compressors such that necessarily effective in that the distance between two objects gives the
C (x) jxj + O(log jxj). (The additive logarithmic term is due to our length of an effective description to go from either object to the other
requirement that the compressed file be a prefix codeword.) We fix a one.
compressor C , and call the fixed compressor the reference compressor. Remark 2.8: Our definition of normalized admissible distance is
Definition 2.4: Let D be an admissible distance. Then D +
(x; y) is more direct than in [31], and the density constraints (II.2) and (II.3)
follow from the definition. In [31], we put a stricter density condition
defined by
in the definition of “admissible” normalized distance, which is, how-
D+ (x; y) = maxfmaxfD(x; z ) : C (z ) C (y)g; ever, harder to satisfy and may be too strict to be realistic. The purpose
maxfD(z; y) : C (z) C (x)gg: of this stricter density condition was to obtain a stronger “universality”
property than the present Theorem 6.3, namely, one with = 1 and
= O(1= maxfC (x); C (y)g). Nonetheless, both definitions coincide
Note that since D(x; y ) = D(y; x), also D+ (x; y ) = D+ (y; x). Fur-
if we set the length of the compressed version C (x) of x to the ultimate
compressed length K (x), the Kolmogorov complexity of x.
thermore, define
D+ (x) = maxfD+ (x; y) : y 2

g: Example 2.9: To obtain a normalized version of the Hamming dis-
tance of Example 2.2, we define hn (x; y ) = Hn (x; y )=Hn+ (x; y ). We
Definition 2.5: Let D be an admissible distance. The normalized can set
admissible distance, also called a similarity distance, d(x; y ), based on
D relative to a reference compressor C , is defined by H + (x; y) = H + (x) = (n + 2)dlog ne + 4dlog log ne + O(1)
n n
since every contemplated compressor C will satisfy C (x) = C (x),

d(x; y) = D+(x; y) : where x is x with all bits flipped (so Hn+ (x; y ) Hn+ (z; z ) for either
D (x; y)
z = x or z = y ). By (II.2), for every x, the number of y in the Ham- This number C (y jx) of bits of information in y , relative to x, can be
ming ball hn (x; y ) e is less than 2eH (x)+1 . This upper bound is viewed as the excess number of bits in the compressed version of xy
an obvious overestimate for e 1= log n. For lower values of e, the compared to the compressed version of x, and is called the amount of
upper bound is correct by the observation that the number of y ’s equals conditional compressed information.
en n
In the definition of compressor, the decompression algorithm is not
en
2nH e( )
included (unlike the case of Kolmorogov complexity, where the decom-
i=0 pressing algorithm is given by definition), but it is easy to construct one.
Given the compressed version of x in C (x) bits, we can run the com-
where H (e) = e log e+(10e) log(10e), Shannon’s entropy function. pressor on all candidate strings z —for example, in length-increasing
Then, eHn+ (x) > en log n > enH (e) since e log n > H (e). lexicographical order, until we find the compressed version of string
z0 equals the compressed version of string x. Since this string decom-
III. NORMAL COMPRESSOR presses to x we have found x = z0 . Given the compressed version of
We give axioms determining a large family of compressors that both xy in C (xy) bits, we repeat this process using strings xz until we find
include most (if not all) real-world compressors and ensure the desired the string xz1 of which the compressed version equals the compressed
properties of the NCD to be defined later. version of xy . Since the former compressed version decompresses to
xy , we have found y = z1 . By the unique decompression property we
Definition 3.1: A compressor C is normal if it satisfies, up to an find that C (y jx) is the extra number of bits we require to describe y
additive O(log n) term, with n the maximal binary length of an element apart from describing x. It is intuitively acceptable that the conditional
of
involved in the (in)equality concerned, the following axioms. compressed information C (xjy ) satisfies the triangle inequality
1) Idempotency: C (xx) = C (x), and C () = 0, where is the
empty string.
C ( x jy ) C ( x jz ) + C ( z jy ) : (III.4)
2) Monotonicity: C (xy ) C (x).
3) Symmetry: C (xy ) = C (yx). Lemma 3.3: Both (III.1) and (III.4) imply (III.2).
4) Distributivity: C (xy ) + C (z ) C (xz ) + C (yz ). Proof: (Equation (III.1) implies (III.2):) By monotonicity.
(Equation (III.4) implies (III.2):) Rewrite the terms in (III.4) ac-
Idempotency: A reasonable compressor will see exact repetitions cording to (III.3), cancel C (y ) in the left- and right-hand sides, use
and obey idempotency up to the required precision. It will also com- symmetry, and rearrange.
press the empty string to the empty string.
Monotonicity: A real compressor must have the monotonicity prop- Lemma 3.4: A normal compressor satisfies additionally subaddi-
erty, at least up to the required precision. The property is evident for tivity: C (xy ) C (x) + C (y ).
stream-based compressors, and only slightly less evident for block- Proof: Consider the special case of distributivity with z the
coding compressors. empty word so that xz = x, yz = y , and C (z ) = 0.
Symmetry: Stream-based compressors of the Lempel–Ziv family, Subadditivity: The subadditivity property is clearly also required
like gzip and pkzip, and the predictive PPM family, like PPMZ, are pos- for every viable compressor, since a compressor may use information
sibly not precisely symmetric. This is related to the stream-based prop- acquired from x to compress y . Minor imprecision may arise from the
erty: the initial file x may have regularities to which the compressor unlearning effect of crossing the border between x and y , mentioned
adapts; after crossing the border to y it must unlearn those regularities in relation to symmetry, but again this must vanish asymptotically with
and adapt to the ones of y . This process may cause some imprecision increasing length of x, y .
in symmetry that vanishes asymptotically with the length of x, y . A
compressor must be poor indeed (and will certainly not be used to any IV. BACKGROUND IN KOLMOGOROV COMPLEXITY
extent) if it does not satisfy symmetry up to the required precision.
Apart from stream-based, the other major family of compressors is Technically, the Kolmogorov complexity of x given y is the length of
block-coding based, like bzip2. They essentially analyze the full input the shortest binary program, for the reference universal prefix Turing
block by considering all rotations in obtaining the compressed version. machine, that on input y outputs x; it is denoted as K (xjy ). For precise
It is to a great extent symmetrical, and real experiments show no de- definitions, theory, and applications, see [32]. The Kolmogorov com-
parture from symmetry. plexity of x is the length of the shortest binary program with no input
Distributivity: The distributivity property is not immediately in- that outputs x; it is denoted as K (x) = K (xj), where denotes the
tuitive. In Kolmogorov complexity theory the stronger distributivity empty input. Essentially, the Kolmogorov complexity of a file is the
property length of the ultimate compressed version of the file. In [4], the infor-
mation distance E (x; y ) was introduced, defined as the length of the
C (xyz ) + C (z ) C (xz ) + C (yz ) (III.1) shortest binary program for the reference universal prefix Turing ma-
chine that, with input x, computes y , and with input y , computes x. It
holds (with K = C ). However, to prove the desired properties of NCD was shown there that, up to an additive logarithmic term,
E (x; y) = maxfK (xjy); K (yjx)g:
below, only the weaker distributivity property
C (xy) + C (z ) C (xz ) + C (yz ) (III.2) It was shown also that E (x; y ) is a metric, up to negligible violations
of the metric inequalties. Moreover, it is universal in the sense that
above is required, also for the boundary case were C = K . In prac- for every admissible distance D(x; y ) as in Definition 2.1, E (x; y )
tice, real-world compressors appear to satisfy this weaker distributivity D(x; y) up to an additive constant depending on D but not on x and
property up to the required precision. y . In [31], the normalized version of E (x; y), called the normalized
Definition 3.2: Define information distance (NID), is defined as
NID(x; y ) =
maxfK (xjy); K (yjx)g :
C (yjx) = C (xy) 0 C (x): (III.3) maxfK (x); K (y)g (IV.1)
It too is a metric, and it is universal in the sense that this single metric Case 2: Assume C(y) C(x). By symmetry C(xy) = C(yx).
minorizes up to a negligible additive error term all normalized admis- Now follow the proof of Case 1.
sible distances in the class considered in [31]. Thus, if two files (of what-
Lemma 5.2: If C is a normal compressor, then EC (x; y) satisfies
ever type) are similar (that is, close) according to the particular feature
the metric (in)equalities up to logarithmic additive precision.
described by a particular normalized admissible distance (not neces-
Proof: Only the triangular inequality is nonobvious. By (III.2),
C(xy) + C(z) C(xz) + C(yz) up to logarithmic additive preci-
sarily metric), then they are also similar (that is, close) in the sense of
the normalized information metric. This justifies calling the latter the
sion. There are six possibilities, and we verify the correctness of the
similarity metric. We stress once more that different pairs of objects
triangular inequality in turn for each of them.
Assume C(x) C(y) C(z). Then
may have different dominating features. Yet every such dominant sim-
ilarity is detected by the NID. However, this metric is based on the no-
tion of Kolmogorov complexity. Unfortunately, the Kolmogorov com-
plexity is noncomputable in the Turing sense. Approximation of the de- C(xy) 0 C(x) C(xz) 0 C(x) + C(yz) 0 C(y):
nominator of (IV.1) by a given compressor C is straightforward: it is
maxfC(x); C(y)g. The numerator is more tricky. It can be rewritten as Assume C(y) C(x) C(z). Then
f
max K(x; y) 0 K(x); K(x; y) 0 K(y)g (IV.2)
C(xy) 0 C(y) C(xz) 0 C(y) + C(yz) 0 C(x):
within logarithmic additive precision, by the additive property of
Kolmogorov complexity [32]. The term K(x; y) represents the length
of the shortest program for the pair (x; y). In compression practice
Assume C(x) C(z) C(y). Then
it is easier to deal with the concatenation xy or yx. Again, within
logarithmic precision K(x; y) = K(xy) = K(yx). Following a
C(xy) 0 C(x) C(xz) 0 C(x) + C(yz) 0 C(z):
C(z) C(x). Then
suggestion by Steven de Rooij, one can approximate (IV.2) best by
minfC(xy); C(yx)g 0 minfC(x); C(y)g. Here, and in the later Assume C(y)
experiments using the CompLearn Toolkit [9], we simply use C(xy)
rather than minfC(xy); C(yx)g. This is justified by the observation C(xy) 0 C(y) C(xz) 0 C(z) + C(yz) 0 C(y):
that block-coding-based compressors are symmetric almost by defini-
tion, and experiments with various stream-based compressors (gzip, Assume C(z) C(x) C(y). Then
PPMZ) show only small deviations from symmetry.
The result of approximating the NID using a real compressor C is C(xy) 0 C(x) C(xz) 0 C(z) + C(yz) 0 C(z):
called the normalized compression distance or NCD, formally defined
in (IV.1). The theory as developed for the Kolmogorov-complexity Assume C(z) C(y) C(x). Then
based NID in [31], may not hold for the (possibly poorly) approxi-
mating NCD. It is nonetheless the case that experiments show that the C(xy) 0 C(y) C(xz) 0 C(z) + C(yz) 0 C(z):
NCD apparently has (some) properties that make the NID so appealing.
To fill this gap between theory and practice, we develop the theory
of NCD from first principles, based on the axiomatics of Section III. Lemma 5.3: If C is a normal compressor, then
We show that the NCD is a quasi-universal similarity metric relative
to a normal reference compressor C . The theory developed in [31] is + f
EC (x; y) = max C(x); C(y) : g
the boundary case C = K , where the “quasi-universality” below has
become full “universality.” Proof: Consider a pair (x; y). The
V. COMPRESSION DISTANCE f
max C(xz) 0 C(z) : C(z) C(y)g
We define a compression distance based on a normal compressor
and show it is an admissible distance. In applying the approach, we is C(x) which is achieved for z = , the empty word, with C() = 0.
have to make do with an approximation based on a far less powerful Similarly, the
real-world reference compressor C . A compressor C approximates the
information distance E(x; y), based on Kolmogorov complexity, by f
max C(yz) 0 C(z) : C(z) C(x)g
the compression distance EC (x; y) defined as
EC (x; y) = C(xy) 0 minfC(x); C(y)g: (V.1) is C(y). Hence, the lemma.
Here, C(xy) denotes the compressed size of the concatenation of x
and y , C(x) denotes the compressed size of x, and C(y) denotes the VI. NORMALIZED COMPRESSION DISTANCE
compressed size of y . The normalized version of the admissible distance EC (x; y), the
compressor C based approximation of the normalized information dis-
Lemma 5.1: If C is a normal compressor, then EC (x; y) + O(1) is
tance (IV.1), is called the normalized compression distance or NCD
an admissible distance.
Proof: Case 1: Assume C(x) C(y). Then EC (x; y) =
C(xy) 0 C(x). Then, given x and a prefix program of length
C(xy) 0 minfC(x); C(y)g :
EC (x; y) consisting of the suffix of the C -compressed version of xy ,
NCD(x; y) =
f
max C(x); C(y) g (VI.1)
and the compressor C in O(1) bits, we can run the compressor C on

This NCD is the main concept of this work. It is the real-world version
all xz ’s, the candidate strings z in length-increasing lexicographical
of the ideal notion of normalized information distance NID in (IV.1).
order. When we find a z so that the suffix of the compressed version of
xz matches the given suffix, then z = y by the unique decompression Remark 6.1: In practice, the NCD is a nonnegative number 0
property. r 1 + representing the differences between the two files. Smaller
numbers represent more similar files. The in the upper bound is due and denominator. Although now the right-hand side
to imperfections in our compression techniques, but for most standard 1
decreases, it must still be greater than , and therefore,
01
compression algorithms one is unlikely to see an above : (in our the right-hand side remains at least as large as the left-
1
experiments, gzip and bzip2 achieved NCDs above , but PPMZ always hand side.
had NCD at most ). 1 ( )
b) NCD x; z NCD x; y ( )+ ( ) NCD y; z . By distributivity,
There is a natural interpretation to NCD(x; y ). If, say, C (y )
we have C xz( )+ ( )
C y C xy ( )+ ( ) C yz . Subtracting
( ), then we can rewrite ()

C x from both sides, rearranging, and dividing both sides
C x
()
by C z we obtain
C (xy ) 0 C (x)
NCD(x; y ) = C (xz ) 0 C (x)
C (y )
C (xyC) (0z)C (x) + C (yzC) (0z)C (y) :
:
C (z )
That is, the distance NCD(x; y ) between x and y is the improvement
due to compressing y using x as previously compressed “database,” The right-hand side does not decrease when we substitute
and compressing y from scratch, expressed as the ratio between the bit- () ()
C y for the denominator C z of the first term, since
wise length of the two compressed versions. Relative to the reference
compressor, we can define the information in x about y as C y 0 () () ()
C y C z . Therefore, the inequality stays valid under
( )
C yjx . Then, using (III.3)
this substitution, which was what we had to prove.
( ) ( )+ ( )
c) NCD y; z NCD y; x NCD x; z . By distributivity, we
( ) = 1 0 C (y)C0(yC)(yjx) :
NCD x; y
( )+ ( ) ( )+ ( )
have C yz C x C yx C xz . Subtracting C y ()
from both sides, using symmetry, rearranging, and dividing
That is, the NCD between x and y is 1 minus the ratio of the information
()
both sides by C z we obtain
C (yz ) 0 C (y )
C (xyC) (0z)C (x) + C (yzC) (0z)C (y) :
in x about y and the information in y .
Theorem 6.2: If the compressor is normal, then the NCD is a nor- C (z )
malized admissible distance satsifying the metric (in)equalities, that is,
The right-hand side does not decrease when we substitute
a similarity metric.
() ()
C y for the denominator C z of the first term, since
Proof: If the compressor is normal, then by Lemmas 5.1 and 5.3,
the NCD is a normalized admissible distance. It remains to show it () ()
C y C z . Therefore, the inequality stays valid under
satisfies the three metric (in)equalities. this substitution, which was what we had to prove.
1) By idempotency, we have NCD x; x ( )=0 . By monotonicity, we Quasi-universality: We now digress to the theory developed in [31],
( ) 0
have NCD x; y for every x, y , with inequality for y 6 x. = which formed the motivation for developing the NCD. If, instead of the
( )=
2) NCD x; y ( ) NCD y; x . The NCD is unchanged by inter- result of some real compressor, we substitute the Kolmogorov com-
changing x and y in (IV.1). plexity for the lengths of the compressed files in the NCD formula,
3) The difficult property is the triangle inequality. Without loss of the result is the NID as in (IV.1). It is universal in the following sense:
() () ()
generality, we assume C x C y C z . Since the NCD Every admissible distance expressing similarity according to some fea-
is symmetrical, there are only three triangle inequalities that can ture, that can be computed from the objects concerned, is comprised
( ) ( ) ( )
be expressed by NCD x; y , NCD x; z , NCD y; z . We verify (in the sense of minorized) by the NID. Note that every feature of the
them in turn. data gives rise to a similarity, and, conversely, every similarity can be
( )
a) NCD x; y NCD x; z ( )+ ( ) NCD z; y : By distributivity, thought of as expressing some feature: being similar in that sense. Our
the compressor itself satisfies actual practice in using the NCD falls short of this ideal theory in at
( ) + C (z) C (xz) + C (zy):
C xy
least three respects.
i) The claimed universality of the NID holds only for indefinitely
Subtracting C (x) from both sides and rewriting long sequences x, y . Once we consider strings x, y of defi-
C (xy ) 0 C (x) C (xz ) 0 C (x) + C (zy ) 0 C (z ):
nite length n, it is only universal with respect to “simple” com-
putable normalized admissible distances, where “simple” means
Dividing by C (y ) on both sides we find that they are computable by programs of length, say, logarithmic
C (xy ) 0 C (x)
C (xz) 0 C (xC) +(yC) (zy) 0 C (z) :
in n. This reflects the fact that, technically speaking, the univer-
C (y )
sality is achieved by summing the weighted contribution of all
similarity distances in the class considered with respect to the
The left-hand side is 1. objects considered. Only similarity distances of which the com-
i) Assume the right-hand side is 1. Setting C (z ) = plexity is small (which means that the weight is large), with re-
C (y )+1, and adding 1 to both the numerator and de- spect to the size of the data concerned, kick in.
nominator of the right-hand side, it can only increase ii) The Kolmogorov complexity is not computable, and it is in prin-
and draw closer to 1. Therefore, ciple impossible to compute how far off the NCD is from the
C (xy ) 0 C (x) NID. So we cannot in general know how well we are doing using
C (y ) the NCD.
C (xz) 0 C (xC) +(yC) +(zy1) 0 C (z) + 1

iii) To approximate the NCD we use standard compression pro-
grams like gzip, PPMZ, and bzip2. While better compression
= C (zxC) (0z)C (x) + C (zyC) (0z)C (y)

of a string will always approximate the Kolmogorov complexity
better, this may not be true for the NCD. Due to its arithmetic
form, subtraction, and division, it is theoretically possible that
which was what we had to prove. while all items in the formula get better compressed, the im-
1
ii) Assume the right-hand side is > . We proceed like provement is not the same for all items, and the NCD value
in the previous case, and add 1
to both numerator moves away from the NID value. In our experiments, we have
not observed this behavior in a noticable fashion. Formally, we

can state the following.
Theorem 6.3: Let d be a computable normalized admissible dis-
tance and C be a normal compressor. Then, NCD(x; y ) d(x; y )+,
where for C (x) C (y ), we have = D+ (x)=C (x) and
= (C (xjy) 0 K (xjy))=C (x), with C (xjy) according to (III.3).
Proof: Fix d, C , x, y in the statement of the theorem. Since the
NCD is symmetrical, we can, without loss of generality, let C (x)
C (y). By (III.3) and the symmetry property C (xy) = C (yx) we
have C (xjy ) C (y jx). Therefore, NCD(x; y ) = C (xjy )=C (x). Let
d(x; y) be the normalized version of the admissible distance D(x; y);
that is, d(x; y ) = D(x; y )=D+ (x; y ). Let d(x; y ) = e. By (II.2),
there are <2eD (x)+1 many (x; v ) pairs, such that d(x; v ) e. Since
d is computable, we can compute and enumerate all these pairs. The
initially fixed pair (x; y ) is an element in the list and its index takes
eD+ (+x) + 1 bits. Therefore, given x, the y can be described by at
most eD (x) + O(1) bits—its index in the list and an O(1) term ac-
counting for the lengths of the programs involved in reconstructing y
given its index in the list, and algorithms to compute functions d and C .
Since the Kolmogorov complexity gives the length of the shortest ef-
fective description, we have K (y jx) eD+ (x) + O(1). Substitution,
rewriting, and using K (xjy ) E (x; y ) D(x; y ) up to ignorable
additive terms (Section IV), yields
NCD(x; y ) = C (xjy )=C (x) e +
which was what we had to prove.

Remark 6.4: Clustering according to NCD will group sequences Fig. 1. The three possible quarter topologies for the set of leaf labels u, v, w, x.
together that are similar according to features that are not explicitly
known to us. Analysis of what the compressor actually does still may
not tell us which features that make sense to us can be expressed by in this setting, and the most natural way is to represent the relations
conglomerates of features analyzed by the compressor. This can be in the form of a dendrogram, which is customarily a directed binary
exploited to track down unknown features implicitly in classification: tree or undirected ternary tree. To construct the tree from a distance
forming automatically clusters of data and see in which cluster (if any) matrix with entries consisting of the pairwise distances between
a new candidate is placed. objects, we use a quartet method. This is a matter of choice only;
Another aspect that can be exploited is exploratory. Given that the other methods may work equally well. The distances we compute in
NCD is small for a pair x, y of specific sequences, what does this re- our experiments are often within the range 0:85 to 1:1. That is, the
ally say about the sense in which these two sequences are similar? The distinguishing features are small, and we need a sensitive method
above analysis suggests that close similarity will be due to a domi- to extract as much information contained in the distance matrix as
nating feature (that perhaps expresses a conglomerate of subfeatures). possible. For example, our experiments showed that reconstructing
Looking into these deeper causes may give feedback about the appro- a minimum spanning tree is not sensitive enough and gives poor
priateness of the realized NCD distances and may help extract more results. With increasing number of data items, the projection of the
intrinsic information about the objects, than the oblivious division into NCD matrix information into the tree representation format gets
clusters, by looking for the common features in the data clusters. increasingly distorted. A similar situation arises in using alignment
cost in genomic comparisons. Experience shows that in both cases the
hierarchical clustering methods seem to work best for small sets of
VII. CLUSTERING
data, up to 25 items, and to deteriorate for larger sets, say 40 items or
Given a set of objects, the pairwise NCDs form the entries of a dis- more. A standard solution to hierarchically cluster larger sets of data
tance matrix. This distance matrix contains the pairwise relations in is to first cluster nonhierarchically, by say multidimensional scaling
raw form. But in this format that information is not easily usable. Just of k -means, available in standard packages, for instance, Matlab, and
as the distance matrix is a reduced form of information representing the then apply hierarchical clustering on the emerging clusters.
original data set, we now need to reduce the information even further The quartet method: We consider every group of four elements
in order to achieve a cognitively acceptable format like data clusters. from our set of n elements; there are n4 such groups. From each
To extract a hierarchy of clusters from the distance matrix, we deter- group u, v , w , x we construct a tree of arity 3, which implies that the
mine a dendrogram (ternary tree) that agrees with the distance matrix tree consists of two subtrees of two leaves each. Let us call such a tree
according to a cost measure. This allows us to extract more information a quartet topology. There are three possibilities denoted i) uv jwx, ii)
from the data than just flat clustering (determining disjoint clusters in uwjvx, and iii) uxjvw , where a vertical bar divides the two pairs of
dimensional representation). leaf nodes into two disjoint subtrees (Fig. 1).
Clusters are groups of objects that are similar according to our For any given tree T and any group of four leaf labels u, v , w , x,
metric. There are various ways to cluster. Our aim is to analyze we say T is consistent with uv jwx if and only if the path from u to
data sets for which the number of clusters is not known a priori, v does not cross the path from w to x. Note that exactly one of the
and the data are not labeled. As stated in [16], conceptually simple, three possible quartet topologies for any set of four labels is consis-
hierarchical clustering is among the best known unsupervised methods tent for any given tree. We may think of a large tree as having many
is to find a full tree with a maximum value of S (T ), which is to say,

the lowest total cost.
To express the notion of computational difficulty one uses the notion
of “nondeterministic polynomial time (NP).” If a problem concerning
n objects is NP-hard this means that the best known algorithm for this
(and a wide class of significant problems) requires computation time

exponential in n. That is, it is infeasible in practice. The MQC deci-
sion problem is the following: Given n objects, let T be a tree of which
the n leaves are labeled by the objects, and let QT be the set of quartet
topologies embedded in T . Given a set of quartet topologies Q, and
an integer k , the problem is to decide whether there is a binary tree T
such that Q QT > k . In [19], it is shown that the MQC decision
problem is NP-hard. For every MQC decision problem one can define
an MQTC problem that has the same solution: give the quartet topolo-
gies in Q cost 0 and the other ones cost 1. This way the MQC decision
problem can be reduced to the MQTC decision problem, which shows
also the latter to be NP-hard. Hence, it is infeasible in practice, but we
can sometimes solve it, and always approximate it. (The reduction also
shows that the quartet problems reviewed in [19] are subsumed by our
problem.) Adapting current methods in [6] to our MQTC optimization
problem results in far too computationally intensive calculations; they
Fig. 2. An example tree consistent with quartet topology uvjwx. run many months or years on moderate-sized problems of 30 objects.
Therefore, we have designed a simple, feasible, heuristic method for
our problem based on randomization and hill-climbing. First, a random
tree with 2n 0 2 nodes is created, consisting of n leaf nodes (with
smaller quartet topologies embedded within its structure. Commonly,
the goal in the quartet method is to find (or approximate as closely as
one connecting edge) labeled with the names of the data items, and
n 0 2 nonleaf or internal nodes labeled with the lowercase letter “n”
possible) the tree that embeds the maximal number of consistent (pos-
sibly weighted) quartet topologies from a given set Q of quartet topolo-
followed by a unique integer identifier. Each internal node has exactly
gies [19] (Fig. 2). This is called the (weighted) maximum quartet con-
three connecting edges. For this tree T , we calculate the total cost of
sistency (MQC) problem.
all embedded quartet toplogies, and invert and scale this value to find
We propose a new optimization problem: the minimum quartet tree 1 of all quartet topologies,
S (T ). A tree is consistent with precisely
cost (MQTC), as follows. The cost of a quartet topology is defined 3
one for every quartet. A random tree may be consistent with about 13 of
as the sum of the distances between each pair of neighbors; that is,
the best quartet topologies—but because of dependencies, this figure
Cuv jwx = d(u; v ) + d(w; x). The total cost CT of a tree T with a set
is not precise. The initial random tree is chosen as the currently best
N of leaves (external nodes of degree 1) is defined as
known tree, and is used as the basis for further searching. We define a
CT = f j
Cuv wx : T is consistent with uvjwxg simple mutation on a tree as one of the three possible transformations.
fu;v;w;xgN 1) A leaf swap, which consists of randomly choosing two leaf
nodes and swapping them.
the sum of the costs of all its consistent quartet topologies. First, we
2) A subtree swap, which consists of randomly choosing two in-
generate a list of all possible quartet topologies for all four-tuples of
ternal nodes and swapping the subtrees rooted at those nodes.
labels under consideration. For each group of three possible quartet
3) A subtree transfer, whereby a randomly chosen subtree (pos-
topologies for a given set of four labels u, v , w , x, calculate a best
sibly a leaf) is detached and reattached in another place, main-
(minimal) cost
taining arity invariants.
m(u; v; w; x) = min f j j
Cuv wx ; Cuw vx ; Cux vw j g Each of these simple mutations keeps the number of leaf nodes and
internal nodes in the tree invariant; only the structure and placements
and a worst (maximal) cost
change. Define a full mutation as a sequence of at least one but poten-
M (u; v; w; x) = max f j j
Cuv wx ; Cuw vx ; Cux vw j g : tially many simple mutations, picked according to the following distri-
bution. First we pick the number k of simple mutations, forming a block
Summing all best quartet topologies yields the best (minimal) cost that we will perform with probability 20k . For each such simple mu-
tation, we choose one of the three types listed above with equal prob-
m = m(u; v; w; x):
ability. Finally, for each of these simple mutations, we uniformly at
fu;v;w;xgN random select leaves or internal nodes, as appropriate. Notice that trees
Conversely, summing all worst quartet topologies yields the worst which are close to the original tree (in terms of number of simple muta-
(maximal) cost tion steps in between) are examined often, while trees that are far away
from the original tree will eventually be examined, but not very fre-
M = M (u; v; w; x):
quently. In order to search for a better tree, we simply apply a full mu-
fu;v;w;xgN tation on T to arrive at T 0 , and then calculate S (T 0 ). If S (T 0 ) > S (T ),
For some distance matrices, these minimal and maximal values cannot then keep T 0 as the new best tree. Otherwise, repeat the procedure. If
be attained by actual trees; however, the score CT of every tree T will 0
S (T ) ever reaches 1, then halt, outputting the best tree. Otherwise, run
lie between these two values. In order to be able to compare tree scores until it seems no better trees are being found in a reasonable amount of
in a more uniform way, we now rescale the score linearly such that the time, in which case the approximation is complete.
worst score maps to 0, and the best score maps to 1, and term this the Note that if a tree is ever found such that S (T ) = 1, then we can
normalized tree benefit score S (T ) = (M 0 CT )=(M 0 m). Our goal stop because we can be certain that this tree is optimal, as no tree could
Fig. 3. Progress of a 60-item data set experiment over time.
have a lower cost. In fact, this perfect tree result is achieved in our Recall, that the “similarity machine” we have described consists of
artificial tree reconstruction experiment (Section VII-A) reliably in a two parts: i) extracting a distance matrix from the data, and ii) con-
few minutes. For real-world data, S (T ) reaches a maximum somewhat structing a tree from the distance matrix using our novel quartet-based
less than 1, presumably reflecting distortion of the information in the heuristic.
distance matrix data by the best possible tree representation, as noted Testing the quartet-based tree construction: We first test whether
above, or indicating getting stuck in a local optimum or a search space the quartet-based tree construction heuristic is trustworthy. We gener-
too large to find the global optimum. On many typical problems of up ated a ternary tree T with 18 leaves, using the pseudorandom number
to 40 objects, this tree search gives a tree with S (T ) 0:9 within half generator “rand” of the Ruby programming language, and derived a
an hour. For large numbers of objects, tree scoring itself can be slow metric from it by defining the distance between two nodes as follows:
(as this takes order n4 computation steps), and the space of trees is also given the length of the path from a to b, in an integer number of edges,
large, so the algorithm may slow down substantially. For larger experi- as L(a; b), let
ments, we use a C++/Ruby implementation with message passing inter-
L(a; b) +1
face (MPI, a common standard used on massively parallel computers) d(a; b) =
18
on a cluster of workstations in parallel to find trees more rapidly. We
can consider the graph mapping the achieved S (T ) score as a function except when a = b, in which case d(a; b) = 0. It is easy to verify
of the number of trees examined. Progress occurs typically in a sig- that this simple formula always gives a number between 0 and 1, and
moidal fashion toward a maximal value 1, Fig. 3. is monotonic with path length. Given only the 18 2 18 matrix of
these normalized distances, our quartet method exactly reconstructed
the original tree T represented in Fig. 4, with S (T ) = 1.
A. Three Controlled Experiments
Testing on artificial data: Given that the tree reconstruction method
With the natural data sets we use, one may have the preconception is accurate on clean consistent data, we tried whether the full proce-
(or prejudice) that, say, music by Bach should be clustered together, dure works in an acceptable manner when we know what the outcome
music by Chopin should be clustered together, and so should music by should be like. We used the “rand” pseudorandom number generator
rock stars. However, the preprocessed music files of a piece by Bach from the C programming language standard library under Linux. We
and a piece by Chopin, or the Beatles, may resemble one another more randomly generated 11 separate 1-kB blocks of data where each byte
than two different pieces by Bach—by accident or indeed by design and was equally probable and called these tags. Each tag was associated
copying. Thus, natural data sets may have ambiguous, conflicting, or with a different lower case letter of the alphabet. Next, we generated
counterintuitive outcomes. In other words, the experiments on natural 22 files of 80 kB each, by starting with a block of purely random bytes
data sets have the drawback of not having an objective clear “correct” and applying one, two, three, or four different tags on it. Applying a tag
answer that can function as a benchmark for assessing our experimental consists of ten repetitions of picking a random location in the 80-kB
outcomes, but only intuitive or traditional preconceptions. We discuss file, and overwriting that location with the globally consistent tag that
three experiments that show that our program indeed does what it is is indicated. So, for instance, to create the file referred to in the diagram
supposed to do—at least in artificial situations where we know in ad- by “a,” we start with 80 kB of random data, then pick ten places to copy
vance what the correct answer is. over this random data with the arbitrary 1-kB sequence identified
Fig. 4. The randomly generated tree that our algorithm reconstructed. S (T ) = 1.
as tag a. Similarly, to create file “ab,” we start with 80 kB of random Bergamasque,” downloaded from various repositories on the World-
data, then pick ten places to put copies of tag a, then pick ten more Wide Web; iv) Two Linux x86 ELF executables (the cp and rm com-
places to put copies of tag b (perhaps overwriting some of the a tags). mands), copied directly from the RedHat 9.0 Linux distribution; and
Because we never use more than four different tags, and therefore never v) Two compiled Java class files, generated by ourselves. The com-
place more than 40 copies of tags, we can expect that at least half of the pressor used to compute the NCD matrix was bzip2. As expected, the
data in each file is random and uncorrelated with the rest of the files. program correctly classifies each of the different types of files together
The rest of the file is correlated with other files that also contain tags in with like near like. The result is reported in Fig. 6 with S (T ) equal
common; the more tags in common, the more related the files are. The to the very high confidence value 0:984. This experiment shows the
compressor used to compute the NCD matrix was bzip2. The resulting power and universality of the method: no features of any specific do-
tree is given in Fig. 5; it can be seen that the clustering has occured main of application are used. We believe that there is no other method
exactly as we would expect. The S (T ) score is 0:905. known that can cluster data that is so heterogenous this reliably. This
Testing on heterogenous natural data: We test gross classification is borne out by the massive experiments with the method in [20].
of files based on heterogenous data of markedly different file types:
i) Four mitochondrial gene sequences, from a black bear, polar bear,
VIII. EXPERIMENTAL VALIDATION
fox, and rat obtained from the GenBank database on the World-Wide
Web; ii) Four excerpts from the novel The Zeppelin’s Passenger by We developed the CompLearn Toolkit, Section I, and performed
E. Phillips Oppenheim, obtained from the Project Gutenberg Edition on experiments in vastly different application fields to test the quality
the World-Wide Web; iii) Four MIDI files without further processing; and universality of the method. The success of the method as reported
two from Jimi Hendrix and two movements from Debussy’s “Suite below depends strongly on the judicious use of encoding of the objects
Fig. 5. Classification of artificial files with repeated 1-kB tags. Not all possiblities are included; for example, file “b” is missing. S (T ) = 0:905.
compared. Here one should use common sense on what a real world events (such as insertions, deletions, point mutations, rearrangements,
compressor can do. There are situations where our approach fails if and inversions) separating two sequences sharing a common ancestor
applied in a straightforward way. For example: comparing text files by will result in the loss of their shared information. Our method (in the
the same authors in different encodings (say, Unicode and 8-b version) form of the CompLearn Toolkit) is a fully automated software tool
is bound to fail. For the ideal similarity metric based on Kolmogorov based on such a distance to compare two genomes.
complexity as defined in [31] this does not matter at all, but for practical 1) Mammalian Evolution: In evolutionary biology, the timing and
compressors used in the experiments it will be fatal. Similarly, in the origin of the major extant placental clades (groups of organisms that
music experiments that follow we use the symbolic MIDI music file have evolved from a common ancestor) continues to fuel debate and
format rather than wave format music files. The reason is that the strings research. Here, we provide evidence by whole mitochondrial genome
resulting from straightforward discretizing the wave form files may be phylogeny for competing hypotheses in two main questions: the
too sensitive to how we discretize. Further research may overcome this grouping of the Eutherian orders, and the Therian hypothesis versus
problem. the Marsupionta hypothesis.
Eutherian orders: We demonstrate (already in [31]) that a whole
mitochondrial genome phylogeny of the Eutherians (placental mam-
A. Genomics and Phylogeny
mals) can be reconstructed automatically from unaligned complete
In recent years, as the complete genomes of various species become mitochondrial genomes by use of an early form of our compression
available, it has become possible to do whole genome phylogeny method, using standard software packages. As more genomic material
(this overcomes the problem that using different targeted parts of the has become available, the debate in biology has intensified concerning
genome, or proteins, may give different trees [36]). Traditional phylo- which two of the three main groups of placental mammals are more
genetic methods on individual genes depended on multiple alignment closely related: Primates, Ferungulates, and Rodents. In [7], the
of the related proteins and on the model of evolution of individual maximum-likelihood method of phylogeny tree reconstruction gave
amino acids. Neither of these is practically applicable to the genome evidence for the (Ferungulates, (Primates, Rodents)) grouping for
level. In absence of such models, a method which can compute the half of the proteins in the mitochondrial genomes investigated, and
shared information between two sequences is useful because biolog- (Rodents, (Ferungulates, Primates)) for the other halves of the mt
ical sequences encode information, and the occurrence of evolutionary genomes. In that experiment they aligned 12 concatenated mitochon-
Fig. 6. Classification of different file types. Tree agrees exceptionally well with NCD distance matrix: S (T ) = 0:984.
drial proteins, taken from 20 species: rat (Rattus norvegicus), house references about the Eutherian orders, and the far more general one
mouse (Mus musculus), grey seal (Halichoerus grypus), harbor seal about the main orders of the mammals (Eutheria, Metatheria, and
(Phoca vitulina), cat (Felis catus), white rhino (Ceratotherium simum), Prototheria). Note also that adding the extra species from 20 to 24 is an
horse (Equus caballus), finback whale (Balaenoptera physalus), blue addition that biologists are loath to do: both for computational reasons
whale (Balaenoptera musculus), cow (Bos taurus), gibbon (Hylobates and for fear of destabilizing a realistic phylogeny by adding even one
lar), gorilla (Gorilla gorilla), human (Homo sapiens), chimpanzee more species to the computation. Furthermore, in the last mentioned
(Pan troglodytes), pygmy chimpanzee (Pan paniscus), orangutan references we used the special-purpose genome compressor GenCom-
(Pongo pygmaeus), Sumatran orangutan (Pongo pygmaeus abelii), press to determine the distance matrix, and the standard biological
using opossum (Didelphis virginiana), wallaroo (Macropus robustus), software MOLPHY package to reconstruct the phylogeny tree from
and the platypus (Ornithorhynchus anatinus) as outgroup. In [30], [31], the distance matrix. In contrast, in this paper, we conduct a larger ex-
we used the whole mitochondrial genomes of the same 20 species, periment than before, using just the general-purpose compressor bzip2
computing the NCD distances (or a closely related distance in [30]), to obtain the distance matrix, and our new quartet tree reconstruction
using the GenCompress compressor, followed by tree reconstruction method to obtain the phylogeny tree—that is, our own CompLearn
using the neighbor joining program in the MOLPHY package [38] package [9], used without any change in all the other experiments.
to confirm the commonly believed morphology-supported hypothesis Marsupionta and Theria: The extant monophyletic divisions
(Rodents, (Primates, Ferungulates)). Repeating the experiment using of the class Mammalia are the Prototheria (monotremes: mammals
the hypercleaning method [6] of phylogeny tree reconstruction gave the that procreate using eggs), Metatheria (marsupials: mammals that
same result. Here, we repeated this experiment several times using the procreate using pouches), and Eutheria (placental mammals: mam-
CompLearn Toolkit using our new quartet method for reconstructing mals that procreate using placentas). The sister relationships between
trees, and computing the NCD with various compressors (gzip, bzip2, these groups is viewed as the most fundamental question in mam-
PPMZ), again always with the same result. These experiments are not malian evolution [21]. Phylogenetic comparison by either anatomy
reported since they are subsumed by the larger experiment of Fig. 7. or mitochondrial genome has resulted in two conflicting hypotheses:
This is a far larger experiment than the one in [30], [31], and aimed the gene-isolation-supported Marsupionta hypothesis ((Prototheria,
at testing two distinct hypotheses simultaneously: the one in the latter Metatheria), Eutheria) versus the morphology-supported Theria hy-
Fig. 7. The evolutionary tree built from complete mammalian mtDNA sequences of 24 species, using the NCD matrix of Fig. 9. We have redrawn the tree from
our output to agree better with the customary phylogeny tree format. The tree agrees exceptionally well with the NCD distance matrix: S (T ) = 0:996.
pothesis (Prototheria, (Methateria, Eutheria)), the third possibility Metatheria, and the Prototheria. These are tightly knit groups with
apparently not being held seriously by anyone. There has been a lot of relatively close NCDs. The ferungulates are a much looser group
support for either hypothesis; recent support for the Theria hypothesis with generally distant NCDs. The intergroup distances show that
was given in [21] by analyzing a large nuclear gene (M6P/IG2R), the Prototheria are furthest away from the other groups, followed by
viewed as important across the species concerned, and even more the Metatheria and the rodents. Also the fine structure of the tree is
recent support for the Marsupionta hypothesis was given in [18] by consistent with biological wisdom.
phylogenetic analysis of another sequence from the nuclear gene (18S Hierarchical versus flat clustering: This is a good place to con-
rRNA) and by the whole mitochondrial genome. trast the informativeness of hierarchical clustering with multidimen-
Experimental evidence: To test the Eutherian orders simultane- sional-scaling based clustering using the same NCD matrix, exhibited
ously with the Marsupionta-versus Theria hypothesis, we added four in Fig. 9. The entries give a good example of typical NCD values; we
animals to the above 20: Australian echidna (Tachyglossus aculeatus), truncated the number of decimals from 15 to 3 significant digits to save
brown bear (Ursus arctos), polar bear (Ursus maritimus), using the space. Note that the majority of distances bunch in the range [0:9; 1].
common carp (Cyprinus carpio) as the outgroup. Interestingly, while This is due to the regularities the compressor can perceive. The diag-
onal elements give the self-distance, which, for PPMZ, is not actually
there are many species of Eutheria and Metatheria, there are only three
0, but is off from 0 only in the third decimal. In Fig. 8, we clustered
species of now living Prototheria known: platypus, and two types of
the 24 animals using the NCD matrix by multidimensional scaling as
echidna (or spiny anteater). So our sample of the Prototheria is large. points in two-dimensional Euclidean space. In this method, the NCD
The addition of the new species might be risky in that the addition of matrix of 24 animals can be viewed as a set of distances between points
new relations is known to distort the previous phylogeny in traditional in n-dimensional Euclidean space (n 24), which we want to project
computational genomics practice. With our method, using the full into a two-dimensional Euclidean space, trying to distort the distances
genome and obtaining a single tree with a very high confidence S (T ) between the pairs as little as possible. This is akin to the problem of
value, that risk is not as great as in traditional methods obtaining projecting the surface of the earth globe on a flat two-dimensional map
ambiguous trees with bootstrap (statistic support) values on the edges. with minimal distance distortion. The main feature is the choice of the
The mitochondrial genomes of the total of 24 species we used were measure of distortion to be minimized, [16]. Let the original set of
downloaded from the GenBank database on the World-Wide Web. distances be d1 ; . . . ; dk and the projected distances be d01 ; . . . ; dk0 . In
Each is around 17 000 bases. The NCD distance matrix was computed Fig. 8, we used the distortion measure Kruskall’s stress-1, [24], which
using the compressor PPMZ. The resulting phylogeny, with an almost minimizes
maximal S (T ) score of 0:996 supports anew the currently accepted
grouping (Rodents, (Primates, Ferungulates)) of the Eutherian or- ( (d i 0 0 )2 )
di =
2
di :
ders, and additionally the Marsupionta hypothesis ((Prototheria, i k k
i
Metatheria), Eutheria), see Fig. 7. Overall, our whole-mitochondrial
NCD analysis supports the following hypothesis: Kruskall’s stress-1 equal 0 means no distortion, and the worst value is
Mammalia at most 1 (unless you have a really bad projection). In the projection of
( (primates, ferungulates)(rodents; (Metatheria, Prototheria))) the NCD matrix according to our quartet method, one minimizes the
more subtle distortion S (T ) measure, where 1 means perfect represen-
Eutheria tation of the relative relations between every 4-tuple, and 0 means min-
which indicates that the rodents, and the branch leading to the imal representation. Therefore, we should compare distortion Kruskall
Metatheria and Prototheria, split off early from the branch that led stress-1 with 1 0 S (T ). Fig. 7 has a very good 1 0 S (T ) = 0:04
to the primates and ferungulates. Inspection of the distance matrix and Fig. 8 has a poor Kruskal stress 0:389. Assuming that the compar-
shows that the primates are very close together, as are the rodents, the ison is significant for small values (close to perfect projection), we find
Fig. 8. Multidimensional clustering of same NCD matrix (Fig. 9) as used for Fig. 7. Kruskal’s stress-1 = 0:389.
Fig. 9. Distance matrix of pairwise NCD. For display purpose, we have truncated the original entries from 15- to 3-decimal precision.
that the multidimensional scaling of this experiment’s NCD matrix is from The Universal Virus Database of the International Committee
formally inferior to that of the quartet tree. This conclusion formally on Taxonomy of Viruses, available on the World-Wide Web. The
justifies the impression conveyed by the figures on visual inspection. SARS virus was downloaded from Canada’s Michael Smith Genome
2) SARS Virus: In another experiment, we clustered the SARS Sciences Centre which had the first public SARS Coronavirus draft
virus after its sequenced genome was made publicly available, in rela- whole genome assembly available for download (SARS TOR2 draft
tion to potential similar virii. The 15 virus genomes were downloaded genome assembly 120403). The NCD distance matrix was com-
Fig. 10. SARS virus among other virii. Legend: AvianAdeno1CELO.inp: Fowl adenovirus 1; AvianIB1.inp: Avian infectious bronchitis virus (strain
Beaudette US); AvianIB2.inp: Avian infectious bronchitis virus (strain Beaudette CK); BovineAdeno3.inp: Bovine adenovirus 3; DuckAdeno1.inp: Duck
adenovirus 1; HumanAdeno40.inp: Human adenovirus type 40; HumanCorona1.inp: Human coronavirus 229E; MeaslesMora.inp: Measles virus strain Moraten;
MeaslesSch.inp: Measles virus strain Schwarz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2;
PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis coronavirus; SARS.inp: SARS TOR2v120403; SIRV1.inp: Sulfolobus virus
SIRV-1; SIRV2.inp: Sulfolobus virus SIRV-2. S (T ) = 0:988.
Fig. 11. Dendrogram of mitochondrial genomes of fungi using NCD. This represents the distance matrix precisely with S (T ) = 0:999.
puted using the compressor bzip2. The relations in Fig. 10 are very (all yeasts), and two filamentous ascomycetes Hypocrea jecorina and
similar to the definitive tree based on medical-macrobio-genomics Verticillium lecanii. The NCD distance matrix was computed using the
analysis, appearing later in the New England Journal of Medicine, compressor PPMZ. The resulting tree is depicted in Fig. 11. The Inter-
[25]. We depicted the figure in the ternary tree style, rather than the pretation of the fungi researchers is “the tree clearly clustered the as-
genomics-dendrogram style, since the former is more precise for comycetous yeasts versus the two filamentous ascomycetes, thus sup-
visual inspection of proximity relations. porting the current hypothesis on their classification (for example, see
3) Analysis of Mitochondrial Genomes of Fungi: As a pilot for ap- [26]). Interestingly, in a recent treatment of the Saccharomycetaceae,
plications of the CompLearn Toolkit in fungi genomics reasearch, the S. servazii, S. castellii, and C. glabrata were all proposed to belong
group of T. Boekhout, E. Kuramae, V. Robert, of the Fungal Biodiver- to genera different from Saccharomyces, and this is supported by the
sity Center, Royal Netherlands Academy of Sciences, supplied us with topology of our tree as well ([27]).”
the mitochondrial genomes of Candida glabrata, Pichia canadensis, To compare the veracity of the NCD clustering with a more fea-
Saccharomyces cerevisiae, S. castellii, S. servazzii, Yarrowia lipolytica ture-based clustering, we also calculated the pairwise distances as fol-
Fig. 12. Dendrogram of mitochondrial genomes of fungi using block frequencies. This represents the distance matrix precisely with S (T ) = 0:999.
Fig. 13. Clustering of Native-American, Native-African, and Native-European languages. S (T ) = 0:928.
lows. Each file is converted to a 4096-dimensional vector by consid- C. Literature

ering the frequency of all (overlapping) 6-b contiguous blocks. The The texts used in this experiment were downloaded from the
l2 -distance (Euclidean distance) is calculated between each pair of files
World-Wide Web in original Cyrillic-lettered Russian and in Latin-let-
by taking the square root of the sum of the squares of the compo- tered English by L. Avanasiev (Moldavian M.Sc. student at the
nent-wise differences. These distances are arranged into a distance ma- University of Amsterdam). The compressor used to compute the NCD
trix and linearly scaled to fit the range [0; 1:0]. Finally, we ran the clus- matrix was bzip2. We clustered Russian literature in the original
tering routine on this distance matrix. The results are in Fig. 12. As (Cyrillic) by Gogol, Dostojevski, Tolstoy, Bulgakov,Tsjechov, with
seen by comparing with the NCD-based Fig. 11 there are apparent mis- three or four different texts per author. Our purpose was to see whether
placements when using the Euclidean distance in this way. Thus, in this the clustering is sensitive enough, and the authors distinctive enough,
simple experiment, the NCD performed better, that is, agreed more pre- to result in clustering by author. In Fig. 14 we see a perfect clustering.
cisely with accepted biological knowledge. Considering the English translations of the same texts, in Fig. 15,
we see errors in the clustering. Inspection shows that the clustering
B. Language Trees is now partially based on the translator. It appears that the translator
Our method improves the results of [1], using a linguistic corpus of superimposes his characteristics on the texts, partially suppressing
“The Universal Declaration of Human Rights (UDoHR)” [35] in 52 the characteristics of the original authors. In other experiments, not
languages. Previously, [1] used an asymmetric measure based on rela- reported here, we separated authors by gender and by period.
tive entropy, and the full matrix of the pair-wise distances between all
52 languages, to build a language classification tree. This experiment D. Music
was repeated (resulting in a somewhat better tree) using the compres- The amount of digitized music available on the internet has grown
sion method in [31] using standard biological software packages to dramatically in recent years, both in the public domain and on commer-
construct the phylogeny. We have redone this experiment, and done cial sites. Napster and its clones are prime examples. Websites offering
new experiments, using the CompLearn Toolkit. Here, we report on an musical content in some form or other (MP3, MIDI,…) need a way to
experiment to separate radically different language families. We down- organize their wealth of material; they need to somehow classify their
loaded the language versions of the UDoHR text in English, Spanish, files according to musical genres and subgenres, putting similar pieces
Dutch, German (Native-European), Pemba, Dendi, Ndbele, Kicongo, together. The purpose of such organization is to enable users to nav-
Somali, Rundi, Ditammari, Dagaare (Native African), Chikasaw, igate to pieces of music they already know and like, but also to give
Perhupecha, Mazahua, Zapoteco (Native-American), and did not pre- them advice and recommendations (“If you like this, you might also
process them except for removing initial identifying information. We like…”). Currently, such organization is mostly done manually by hu-
used a Lempel–Ziv-type compressor gzip to compress text sequences mans, but some recent research has been looking into the possibilities
of sizes not exceeding the length of the sliding window gzip uses of automating music classification.
(32 kB), and compute the NCD for each pair of language sequences. Initially, we downloaded 36 separate MIDI files selected from a
Subsequently, we clustered the result. We show the outcome of one range of classical composers, as well as some popular music. The
of the experiments in Fig. 13. Note that three groups are correctly files were downloaded from several different MIDI databases on the
clustered, and that even the subclusters of the European languages are World-Wide Web. The identifying information, composer, title, and
correct (English is grouped with the Romance languages because it so on, was stripped from the files (otherwise this may give a marginal
contains up to 40% admixture of words from Latin origin). advantage to the compressor to identify composers). Each of these
Fig. 14. Clustering of Russian writers. Legend: I.S. Turgenev, 1818–1883 [Father and Sons, Rudin, On the Eve, A House of Gentlefolk]; F. Dostoyevsky
1821–1881 [Crime and Punishment, The Gambler, The Idiot; Poor Folk]; L.N. Tolstoy 1828–1910 [Anna Karenina, The Cossacks, Youth, War and Piece]; N.V.
Gogol 1809–1852 [Dead Souls, Taras Bulba, The Mysterious Portrait, How the Two Ivans Quarrelled]; M. Bulgakov 1891–1940 [The Master and Margarita,
The Fatefull Eggs, The Heart of a Dog]. S (T ) = 0:949.
Fig. 15. Clustering of Russian writers translated in English. The translator is given in brackets after the titles of the texts. Legend: I.S. Turgenev, 1818–1883 [Father
and Sons (R. Hare), Rudin (Garnett, C. Black), On the Eve (Garnett, C. Black), A House of Gentlefolk (Garnett, C. Black)]; F. Dostoyevsky 1821–1881 [Crime and
Punishment (Garnett, C. Black), The Gambler (C.J. Hogarth), The Idiot (E. Martin); Poor Folk (C.J. Hogarth)]; L.N. Tolstoy 1828–1910 [Anna Karenina (Garnett,
C. Black), The Cossacks (L. and M. Aylmer), Youth (C.J. Hogarth), War and Piece (L. and M. Aylmer)]; N.V. Gogol 1809–1852 [Dead Souls (C.J. Hogarth), Taras

Bulba ( G. Tolstoy, 1860, B.C. Baskerville), The Mysterious Portrait and How the Two Ivans Quarrelled ( I.F. Hapgood]; M. Bulgakov 1891–1940 [The Master
and Margarita (R. Pevear, L. Volokhonsky), The Fatefull Eggs (K. Gook-Horujy), The Heart of a Dog (M. Glenny)]. S (T ) = 0:953.
files was run through a preprocessor to extract just MIDI Note-On and a time step and the end of a track. This file is then used as input to the
Note-Off events. These events were then converted to a player-piano compression stage for distance matrix calculation and subsequent tree
style representation, with time quantized in 0.05-s intervals. All search. To check whether any important feature of the music was lost
instrument indicators, MIDI control signals, and tempo variations during preprocessing, we played it back from the preprocessed files to
were ignored. For each track in the MIDI file, we calculate two compare it with the original. To the authors the pieces sounded almost
quantities: An average volume and a modal note. The average volume unchanged. The compressor used to compute the NCD matrix of the
is calculated by averaging the volume (MIDI note velocity) of all notes genres tree, Fig. 16, and that of 12-piece music set, Fig. 17 is bzip2.
in the track. The modal note is defined to be the note pitch that sounds For the full range of the music experiments see [10].
most often in that track. If this is not unique, then the lowest such note Before testing whether our program can see the distinctions between
is chosen. The modal note is used as a key-invariant reference point various classical composers, we first show that it can distinguish be-
from which to represent all notes. It is denoted by 0, higher notes are tween three broader musical genres: classical music, rock, and jazz.
denoted by positive numbers, and lower notes are denoted by negative This may be easier than making distinctions “within” classical music.
numbers. A value of 1 indicates a half-step above the modal note, All musical pieces we used are listed in the tables in [10]. For the genre-
and a value of 02 indicates a whole-step below the modal note. The experiment we used, 12 classical pieces consisting of Bach, Chopin,
tracks are sorted according to decreasing average volume, and then and Debussy, 12 jazz pieces, and 12 rock pieces. The tree (Fig. 16)
output in succession. For each track, we iterate through each time that our program came up with has S (T ) = 0:858. The discrimi-
sample in order, outputting a single signed 8-b value for each currently nation between the three genres is reasonable but not perfect. Since
sounding note. Two special values are reserved to represent the end of S (T ) = 0:858, a fairly low value, the resulting tree does not represent
Fig. 16. Output for the 36 pieces from three music genres. Legend: 12 Jazz: John Coltrane [“Blue Trane,” “Giant Steps,” “Lazy Bird,” “Impressions”]; Miles
Davis [“Milestones,” “Seven Steps to Heaven,” “Solar,” “So What”]; George Gershwin [“Summertime”]; Dizzy Gillespie [“Night in Tunisia”]; Thelonious Monk
[“Round Midnight”]; Charlie Parker [“Yardbird Suite”]; 12 Rock and Pop: The Beatles [“Eleanor Rigby,” “Michelle”]; Eric Clapton [“Cocaine,” “Layla”]; Dire
Straits [“Money for Nothing”]; Led Zeppelin [“Stairway to Heaven”]; Metallica [“One”]; Jimi Hendrix [“Hey Joe,” ”Voodoo Chile”]; The Police [“Every Breath
You Take,” “Message in a Bottle”] Rush [“Yyz”]; 12 Classic: see caption for Fig. 17. S (T ) = 0:858.
the NCD distance matrix very well. Presumably, the information in the a reason that at this point is hidden from us. In fact, running this ex-
NCD distance matrix cannot be represented by a dendrogram of high periment with different compressors and termination conditions con-
S (T ) score. This appears to be a common problem with large (> 25 sistently displayed this anomaly. The small set encompasses the four
or so) natural data sets. Another reason may be that the program termi- movements from Debussy’s “Suite Bergamasque,” four movements of
nated, while trapped in a local optimum. We repeated the experiment book 2 of Bach’s “Wohltemperierte Klavier,” and four preludes from
many times with almost the same results, so that does not appear to Chopin’s “Opus 28.” As one can see in Fig. 17, our program does
be the case. The 11-item subtree rooted at n4 contains 10 of the 12 a pretty good job at clustering these pieces. The S (T ) score is also
jazz pieces, together with a piece of Bach’s “Wohltemporierte Klavier high: 0:968. The four Debussy movements form one cluster, as do the
(WTK).” The other two jazz pieces, Miles Davis’ “So What,” and John four Bach pieces. The only imperfection in the tree, judged by what
Coltrane’s “Giant Steps” are placed elsewhere in the tree, perhaps ac- one would intuitively expect, is that Chopin’s Prélude no. 15 lies a bit
cording to some kinship that now escapes us (but may be identified closer to Bach than to the other three Chopin pieces. This Prélude no
by closer studying of the objects concerned). Of the 12 rock pieces, 15, in fact, consistently forms an odd-one-out in our other experiments
10 are placed in the 12-item subtree rooted at n29, together with a as well. This is an example of pure data mining, since there is some
piece of Bach’s “WTK,” and Coltrane’s “Giant Steps,” while Hendrix’s musical truth to this, as no. 15 is perceived as by far the most eccentric
“Voodoo Chile” and Rush “Yyz” is further away. Of the 12 classical among the 24 Préludes of Chopin’s opus 28.
pieces, 10 are in the 13-item subtrees rooted at the branch n8, n13, n6,
n2, n7, together with Hendrix’s “Voodoo Chile,” Rush’s “Yyz,” and
E. Optical Character Recognition
Miles Davis’ “So What.” Surprisingly, two of the four Bach “WTK”
pieces are placed elsewhere. Yet we perceive the four Bach pieces to Can we also cluster two-dimensional images? Because our method
be very close, both structurally and melodically (as they all come from appears focused on strings this is not straightforward. It turns out that
the monothematic “Wohltemporierte Klavier”). But the program finds scanning a picture in raster row-major order retains enough regularityin
Fig. 17. Output for the 12-piece set. Legend: J.S. Bach [“Wohltemperierte Klavier II: Preludes and Fugues 1,2”— BachWTK2{F,P}{1,2}]; Chopin [“Préludes
op. 28: 1, 15, 22, 24” —ChopPrel{1,15,22,24}]; Debussy [“Suite Bergamasque,” four movements—DebusBerg{1,2,3,4}]. S (T ) = 0:968.
lines are added at the end of each line. Each character is 128 2 128
pixels. The NCD matrix was computed using the compressor PPMZ.
Fig. 19 shows the clusters obtained. There are 10 of each digit “4,” “5,”
“6,” making a total of 30 items in this experiment. All but one of the
4’s are put in the subtree rooted at n1, all but one of the 5’s are put
in the subtree rooted at n4, and all 6’s are put in the subtree rooted at
n3. The remaining 4 and 5 are in the branch n23, n13 joining n6 and
n3. So 28 items out of 30 are clustered correctly, that is, 93%. In this
experiment, we used only three digits. Using the full set of decimal
digits means that too many objects are involved, resulting in a lower
clustering accuracy. However, we can use the NCD as an oblivious fea-
ture-extraction technique to convert generic objects into finite-dimen-
sional vectors. We have used this technique to train a support vector
machine (SVM) based OCR system to classify handwritten digits by
extracting 80 distinct, ordered NCD features from each input image.
In this initial stage of ongoing research, by our oblivious method of
computing the NCDs to use in the SVM classifier, we achieved a hand-
written single decimal digit recognition accuracy of 87%. The current
state-of-the-art for this problem, after half a century of interactive fea-
ture-driven classification research, is in the upper 90% level [34], [40].
All experiments are benchmarked on the standard NIST Special Data
Fig. 18. Images of handwritten digits used for optical character recognition Base 19. Using the NCD for general classification by compression is
(OCR}.
the subject of a future paper.
both dimensions for the compressor to grasp. A simple task along these F. Astronomy
lines is to cluster handwritten characters. The handwritten characters in As a proof of principle, we clustered data from unknown objects, for
Fig. 18 were downloaded from the NIST Special Data Base 19 (optical example objects that are far away. In [3], observations of the micro-
character recognition (OCR) database) on the World-Wide Web (cur- quasar GRS 1915 + 105 made with the Rossi X-ray Timing Eplorer
rently only available for purchase, at a recent check). Each file in the were analyzed. The interest in this microquasar stems from the fact
data directory contains one digit image, either a four, five, or six. Each that it was the first galactic object to show a certain behavior (super-
pixel is a single character: “#” for a black pixel, “1” for white. New luminal expansion in radio observations). Photonometric observation
Fig. 19. Clustering of the OCR images. S (T ) = 0:901.
data from X-ray telescopes were divided into short time segments (usu- observations are essentially time series, and our aim was experimenting
ally in the order of 1 min), and these segments have been classified with our method as a pilot to more extensive joint research. Here, the
into a bewildering array of 15 different modes after considerable ef- task was to see whether the clustering would agree with the classifi-
fort. Briefly, spectrum hardness ratios (roughly, “color”) and photon cation above. The NCD matrix was computed using the compressor
count sequences were used to classify a given interval into categories PPMZ. The results are in Fig. 20. We clustered 12 objects, consisting
of variability modes. From this analysis, the extremely complex vari- of three intervals from four different categories denoted as , , , in
ability of this source was reduced to transitions between three basic [3, Table I]. In Fig. 20, we denote the categories by the corresponding
states, which, interpreted in astronomical terms, gives rise to an expla- Roman letters D, G, P, and T, respectively. The resulting tree groups
nation of this peculiar source in standard black-hole theory. The data these different modes together in a way that is consistent with the clas-
we used in this experiment were made available to us by M. Klein Wolt sification by experts for these observations. The oblivious compression
(coauthor of [3]) and T. Maccarone, both researchers at the Astronom- clustering corresponds precisely to the laborious feature-driven classi-
ical Institute “Anton Pannekoek” at the University of Amsterdam. The fication in [3]. Further work on clustering of (possibly heterogenous)
Fig. 20. Sixteen observation intervals of GRS 1915+105 from four classes. The initial capital letter indicates the class corresponding to Greek lower case letters
in [3]. The remaining letters and digits identify the particular observation interval in terms of finer features and identity. The T -cluster is top left, the P -cluster is
bottom left, the G-cluster is to the right, and the D -cluster in the middle. This tree almost exactly represents the underlying NCD distance matrix: S (T ) = 0:994.
time series and anomaly detection, using the new compression method, ACKNOWLEDGMENT
was recently done on a massive scale in [20].
The authors wish to thank Loredana Afanasiev, Graduate School of
Logic, University of Amsterdam; Teun Boekhout, Eiko Kuramae, Vin-
IX. CONCLUSION cent Robert, Fungal Biodiversity Center, Royal Netherlands Academy
To interpret what the NCD is doing, and to explain its remarkable of Sciences; Marc Klein Wolt, Thomas Maccarone, Astronomical Insti-
accuracy and robustness across application fields and compressors, the tute “Anton Pannekoek,” University of Amsterdam; Evgeny Verbitskiy,
intuition is that the NCD minorizes all similarity distances based on Philips Research; Steven de Rooij, Ronald de Wolf, CWI; the referees
features that are captured by the reference compressor involved. Such and the editors, for suggestions, comments, help with experiments, and
features must be relatively simple in the sense that they are expressed data; Jorma Rissanen and Boris Ryabko for discussions, John Lang-
by an aspect that the compressor analyzes (for example, frequencies, ford for suggestions, Tzu-Kuo Huang for pointing out some typos and
matches, repeats). Certain sophisticated features may well be express- simplifications, and Teemu Roos and Henry Tirri for implementing a
ible as combinations of such simple features, and are therefore them- visualization of the clustering process.
selves simple features in this sense. The extensive experimenting above
shows that even elusive features are captured.
A potential application of our nonfeature (or rather, many-unknown- REFERENCES
feature) approach is exploratory. Presented with data for which the fea- [1] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,”
tures are as yet unknown, certain dominant features governing simi- Phys. Rev. Lett., vol. 88, no. 4, pp. 048702-1–048702-4, 2002.
larity are automatically discovered by the NCD. Examining the data un- [2] P. Ball, “Algorithm makes tongue tree,” Nature, Science update, Jan. 22,
derlying the clusters may yield this hitherto unknown dominant feature. 2002.
[3] T. Belloni, M. Klein-Wolt, M. Méndez, M. van der Klis, and J. van
Our experiments indicate that the NCD has application in two new Paradijs, “A model-independent analysis of the variability of GRS
areas of support vector machine (SVM) based learning. First, we find 1915 + 105,” Astron. Astrophys., vol. 355, pp. 271–290, 2000.
that the complemented NCD (10NCD) is useful as a kernel for generic [4] C. H. Bennett, P. Gács, M. Li, P. M. B. Vitányi, and W. Zurek, “Informa-
objects in SVM learning. Secondly, we can use the normal NCD as a tion distance,” IEEE Trans. Inf. Theory, vol. 44, no. 4, pp. 1407–1423,
feature-extraction technique to convert generic objects into finite-di- Jul. 1998.
mensional vectors, see the last paragraph of Section VIII-E. In effect, [5] C. H. Bennett, M. Li, and B. Ma, “Chain letters and evolutionary histo-
ries,” Scient. Amer., pp. 76–81, Jun. 2003.
our similarity engine aims at the ideal of a perfect data mining process,
[6] D. Bryant, V. Berry, P. Kearney, M. Li, T. Jiang, T. Wareham, and H.
discovering unknown features in which the data can be similar. This Zhang, “A practical algorithm for recovering the best supported edges
is the subject of ongoing joint research in genomics of fungi, clinical of an evolutionary tree,” in Proc. 11th ACM-SIAM Symp. Discrete Algo-
molecular genetics, and radio astronomy. rithms, San Francisco, CA, Jan. 9–11, 2000, pp. 287–296.
[7] Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, [30] M. Li and P. M. B. Vitányi, “Algorithmic complexity,” in International
N. Okada, S. Pääbo, and M. Hasegawa, “Conflict among individual mi- Encyclopedia of the Social & Behavioral Sciences, N. J. Smelser and P.
tochondrial proteins in resolving the phylogeny of eutherian orders,” J. B. Baltes, Eds. Oxford, U.K.: Pergamon, 2001/2002, pp. 376–382.
Mol. Evol., vol. 47, pp. 307–322, 1998. [31] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity
[8] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, “Shared infor- metric,” IEEE Trans. Inf. Theory, vol. 50, no. 12, pp. 3250–3264, Dec.
mation and program plagiarism detection,” IEEE Trans. Inf. Theory, vol. 2004.
50, no. 7, pp. 1545–1551, Jul. 2004. [32] M. Li and P. M. B. Vitányi, An Introduction to Kolmogorov Complexity
[9] R. Cilibrasi. (2003) The CompLearn Toolkit. [Online]. Available: and Its Applications, 2nd ed. New York: Springer-Verlag, 1997.
http://complearn.sourceforge.net/ [33] A. Londei, V. Loreto, and M. O. Belardinelli, “Music style and author-
[10] R. Cilibrasi, P. M. B. Vitányi, and R. de Wolf, “Algorithmic clustering ship categorization by informative compressors,” in Proc. 5th Triannual
of music based on string compression,” Comp. Music J., vol. 28, no. 4, Conf. European Society for the Cognitive Sciences of Music (ESCOM),
pp. 49–67, 2004. Hannover, Germany, Sep. 8–13, 2003, pp. 200–203.
[11] G. Cormode, M. Paterson, S. Sahinalp, and U. Vishkin, “Communication [34] L. S. Oliveira, R. Sabourin, F. Bortolozzi, and C. Y. Suen, “Automatic
complexity of document exchange,” in Proc. 11th ACM-SIAM Symp. recognition of handwritten numerical strings: A recognition and verifi-
Discrete Algorithms, San Francisco, CA, Jan. 9–11, 2000, pp. 197–206. cation strategy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 11,
[12] T. M. Cover and J. A. Thomas, Elements of Information Theory. New pp. 1438–1454, Nov. 2002.
York: Wiley, 1991. [35] United Nations General Assembly Resolution 217 A (III) of 10 De-
[13] W. Chai and B. Vercoe, “Folk music classification using hidden Markov cember 1948. Universal Declaration of Human Rights. [Online]. Avail-
model,” in Proc. Int. Conf. Artificial Intelligence, Las Vegas, NV, Jun. able: http://www.un.org/Overview/rights.html
2001. [36] A. Rokas, B. L. Williams, N. King, and S. B. Carroll, “Genome-scale
[14] M. Cooper and J. Foote, “Automatic music summarization via similarity approaches to resolving incongruence in molecular phylogenies,” Na-
analysis,” in Proc. IRCAM, Paris, France, 2002, pp. 81–85. ture, vol. 425, pp. 798–804, Oct. 2003.
[15] R. Dannenberg, B. Thom, and D. Watson, “A machine learning approach [37] D. Salomon, Data Compression. New York: Springer-Verlag, 1997.
to musical style recognition,” in Proc. Int. Computer Music Conf., Thes- [38] N. Saitou and M. Nei, “The neighbor-joining method: A new method for
saloniki, Greece, Sep. 25–30, 1997, pp. 344–347. reconstructing phylogenetic trees,” Mol. Biol. Evol., vol. 4, pp. 406–425,
[16] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. 1987.
New York: Wiley Interscience, 2001. [39] P. Scott. (2001) Music Classification Using Neural Networks. [Online].
[17] M. Grimaldi, A. Kokaram, and P. Cunningham. (2002) Classi- Available: http://www.stanford.edu/class/ee373a/musicclassification.
fying Music by Genre Using the Wavelet Packet Transform and a pdf
Round-Robin Ensemble. Trinity College, Dublin, U.K.. [Online]. Avail- [40] Ø. D. Trier, A. K. Jain, and T. Taxt, “Feature extraction methods for char-
able: http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCD- acter recognition—A survey,” Patt. Recogn., vol. 29, no. 4, pp. 641–662,
CS-2002-64.pdf 1996.
[18] A. Janke, O. Magnell, G. Wieczorek, M. Westerman, and U. Arnason, [41] P. N. Yianilos. (1991) Normalized Forms for Two Common Metrics.
“Phylogenetic analysis of 18S rRNA and the mitochondrial genomes NEC Res. Inst.. [Online]. Available: http://www.pnylab.com/pny/
of wombat, vombatus ursinus, and the spiny anteater, tachyglossus ace- [42] A. C.-C. Yang, C.-K. Peng, H.-W. Yien, and A. L. Goldberger, “Infor-
laetus: Increased support for the marsupionta hypothesis,” J. Mol. Evol., mation categorization approach to literary authorship disputes,” Physica
vol. 54, no. 1, pp. 71–80, 2002. A, vol. 329, pp. 473–483, 2003.
[19] T. Jiang, P. Kearney, and M. Li, “A polynomial time approximation [43] G. Tzanetakis and P. Cook, “Music genre classification of audio signals,”
scheme for inferring evolutionary trees from quartet topologies and its IEEE Trans. Speech and Audio Process., vol. 10, no. 5, pp. 293–302, Jul.
application,” SIAM J. Comput., vol. 30, no. 6, pp. 1942–1961, 2001. 2002.
[20] E. Keogh, S. Lonardi, and C. A. Rtanamahatana, “Toward parameter- [44] S. Wehner. (2004) Analyzing Network Traffic and Worms Using Com-
free data mining,” in Proc. 10th ACM SIGKDD Int. Conf. Knowledge pression. CWI. [Online]. Available: http://homepages.cwi.nl/~wehner/
Discovery and Data Mining, Seattle, WA, Aug. 22–25, 2004, pp. worms/
206–215.
[21] J. K. Killian, T. R. Buckley, N. Steward, B. L. Munday, and R. L. Jirtle,
“Marsupials and eutherians reunited: Genetic evidence for the theria hy-
pothesis of mammalian evolution,” Mammalian Genome, vol. 12, pp.
513–517, 2001.
[22] M. Koppel, S. Argamon, and A. R. Shimoni, “Automatically catago-
rizing written texts by author gender,” Literary and Linguistic Comput.,
vol. 17, no. 4, pp. 401–412, 2002.
[23] A. Kraskov, H. Stögbauer, R. G. Adrsejak, and P. Grassberger. (2003)
Hierarchical Clustering Based on Mutual Information. [Online]. Avail-
able: http://arxiv.org/abs/q-bio/0311039
[24] J. B. Kruskal, “Nonmetric multidimensional scaling: A numerical
method,” Psychometrika, vol. 29, pp. 115–129, 1964.
[25] T. G. Ksiazek et al., “A novel coronavirus associated with severe acute
respiratory syndrome,” New England J. Med., vol. 349, p. 709, Aug. 14,
2003.
[26] C. P. Kurtzman and J. Sugiyama, “Ascomycetous yeasts and yeast-like
taxa,” in The Mycota VII, Systemtics and Evolution. Berlin, Germany:
Springer-Verlag, 2001, pt. A, pp. 179–200.
[27] C. P. Kurtzman, “Phylogenetic circumscription of saccharomyces,
kluyveromyces and other members of the saccharomycetaceaea, and
the proposal of the new genera lachnacea, nakaseomyces, naumovia,
vanderwaltozyma and zygotorulaspora,” FEMS Yeast Res., vol. 4, pp.
233–245, 2003.
[28] P. S. Laplace, A Philosophical Essay on Probabilities. New York:
Dover, 1951. English translation of the 1918 French original version.
[29] M. Li, J. H. Badger, X. Chen, S. Kwong, P. Kearney, and H. Zhang,
“An information-based sequence distance and its application to whole
mitochondrial genome phylogeny,” Bioinformatics, vol. 17, no. 2, pp.
149–154, 2001.

Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression

Uploaded by

Document Informationclick to expand document informationTentang Cluster

Document Informationclick to expand document information

Copyright:

Available Formats

Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering by Compression: Abstract-We Present A New Method For Clustering Based On Compression

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO.

4, APRIL 2005 1523

0018-9448/$20.00 © 2005 IEEE

jfy : d(x; y)  egj < 2

H (x; y) = 2 log n + 4 log log n + d log n + O(1)

as the length of a preﬁx code word (preﬁx program) to compute x from 1 20 (

y and vice versa. Then, by the Kraft inequality, y y:d(x;y )e

D+ (x) = maxfD+ (x; y) : y 2

since every contemplated compressor C will satisfy C (x) = C (x),

and the compressor C in O(1) bits, we can run the compressor C on

( ), then we can rewrite ()

 C (xz) 0 C (xC) +(yC) +(zy1) 0 C (z) + 1

= C (zxC) (0z)C (x) + C (zyC) (0z)C (y)

not observed this behavior in a noticable fashion. Formally, we

NCD(x; y ) = C (xjy )=C (x)  e + 

which was what we had to prove.

is to ﬁnd a full tree with a maximum value of S (T ), which is to say,

(and a wide class of signiﬁcant problems) requires computation time

Fig. 3. Progress of a 60-item data set experiment over time.

Fig. 4. The randomly generated tree that our algorithm reconstructed. S (T ) = 1.

Fig. 13. Clustering of Native-American, Native-African, and Native-European languages. S (T ) = 0:928.

lows. Each ﬁle is converted to a 4096-dimensional vector by consid- C. Literature

Fig. 19. Clustering of the OCR images. S (T ) = 0:901.

You might also like

jfy : d(x; y) egj < 2

as the length of a preﬁx code word (preﬁx program) to compute x from 1 20 (

y and vice versa. Then, by the Kraft inequality, y y:d(x;y )e

C (xz) 0 C (xC) +(yC) +(zy1) 0 C (z) + 1

NCD(x; y ) = C (xjy )=C (x) e +