The Google Similarity Distance
Rudi L. Cilibrasi and Paul M.B. Vitányi
Abstract— Words and phrases acquire meaning from the
way they are used in society, from their relative semantics
to other words and phrases. For computers the equivalent
of ‘society’ is ‘database,’ and the equivalent of ‘use’ is
‘way to search the database.’ We present a new theory of
similarity between words and phrases based on information
distance and Kolmogorov complexity. To fix thoughts we
use the world-wide-web as database, and Google as search
engine. The method is also applicable to other search engines
and databases. This theory is then applied to construct
a method to automatically extract similarity, the Google
similarity distance, of words and phrases from the world-wideweb using Google page counts. The world-wide-web is the
largest database on earth, and the context information entered
by millions of independent users averages out to provide
automatic semantics of useful quality. We give applications
in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and
numbers, cluster names of paintings by 17th century Dutch
masters and names of books by English novelists, the ability to
understand emergencies, and primes, and we demonstrate the
ability to do a simple automatic English-Spanish translation.
Finally, we use the WordNet database as an objective baseline
against which to judge the performance of our method. We
conduct a massive randomized trial in binary classification
using support vector machines to learn categories based on
our Google distance, resulting in an a mean agreement of 87%
with the expert crafted WordNet categories.
Index Terms—
accuracy comparison with WordNet categories, automatic
classification and clustering, automatic meaning discovery
using Google, automatic relative semantics, automatic translation, dissimilarity semantic distance, Google search, Google
distribution via page hit counts, Google code, Kolmogorov
complexity, normalized compression distance ( NCD ),
∗ The
material of this paper was presented in part at the IEEE ITSOC
Information Theory Workshop 2005 on Coding and Complexity, 29th
Aug. - 1st Sept., 2005, Rotorua, New Zealand, and the IEEE Intn’l
Symp. Information Theory, Seattle, Wash. USA, August 2006. Manuscript
received April 12, 2005; final revision June 18, 2006. Rudi Cilibrasi
was supported in part by the Netherlands BSIK/BRICKS project, and
by NWO project 612.55.002. He is at the Centre for Mathematics and
Computer Science (Centrum voor Wiskunde en Informatica), Amsterdam,
the Netherlands. Address: CWI, Kruislaan 413, 1098 SJ Amsterdam,
The Netherlands. Email:
[email protected]. Paul Vitanyi’s work
was done in part while the author was on sabbatical leave at National
ICT of Australia, Sydney Laboratory at UNSW. He is affiliated with the
Centre for Mathematics and Computer Science (Centrum voor Wiskunde
en Informatica) and the University of Amsterdam, both in Amsterdam,
the Netherlands. Supported in part by the EU EU Project RESQ IST2001-37559, the ESF QiT Programmme, the EU NoE PASCAL, and the
Netherlands BSIK/BRICKS project. Address: CWI, Kruislaan 413, 1098
SJ Amsterdam, The Netherlands. Email:
[email protected].
∗
normalized information distance ( NID ), normalized Google
distance ( NGD ), meaning of words and phrases extracted
from the web, parameter-free data-mining, universal similarity metric
I. Introduction
Objects can be given literally, like the literal four-letter
genome of a mouse, or the literal text of War and Peace
by Tolstoy. For simplicity we take it that all meaning
of the object is represented by the literal object itself.
Objects can also be given by name, like “the four-letter
genome of a mouse,” or “the text of War and Peace
by Tolstoy.” There are also objects that cannot be given
literally, but only by name, and that acquire their meaning
from their contexts in background common knowledge in
humankind, like “home” or “red.” To make computers
more intelligent one would like to represent meaning in
computer-digestable form. Long-term and labor-intensive
efforts like the Cyc project [22] and the WordNet project
[33] try to establish semantic relations between common
objects, or, more precisely, names for those objects. The
idea is to create a semantic web of such vast proportions
that rudimentary intelligence, and knowledge about the real
world, spontaneously emerge. This comes at the great cost
of designing structures capable of manipulating knowledge,
and entering high quality contents in these structures by
knowledgeable human experts. While the efforts are longrunning and large scale, the overall information entered is
minute compared to what is available on the world-wideweb.
The rise of the world-wide-web has enticed millions of
users to type in trillions of characters to create billions of
web pages of on average low quality contents. The sheer
mass of the information about almost every conceivable
topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate
sense. We devise a general method to tap the amorphous
low-grade knowledge available for free on the world-wideweb, typed in by local users aiming at personal gratifi cation
of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the
world. Moreover, this database is available for all by using
any search engine that can return aggregate page-count
estimates for a large range of search-queries, like Google.
Previously, we and others developed a compression-
based method to establish a universal similarity metric
among objects given as fi nite binary strings [2], [25], [26],
[7], [8], which was widely reported [20], [21], [13]. Such
objects can be genomes, music pieces in MIDI format,
computer programs in Ruby or C, pictures in simple bitmap
formats, or time sequences such as heart rhythm data.
This method is feature-free in the sense that it doesn’t
analyze the fi les looking for particular features; rather it
analyzes all features simultaneously and determines the
similarity between every pair of objects according to the
most dominant shared feature. The crucial point is that
the method analyzes the objects themselves. This precludes
comparison of abstract notions or other objects that don’t
lend themselves to direct analysis, like emotions, colors,
Socrates, Plato, Mike Bonanno and Albert Einstein. While
the previous method that compares the objects themselves is
particularly suited to obtain knowledge about the similarity
of objects themselves, irrespective of common beliefs about
such similarities, here we develop a method that uses only
the name of an object and obtains knowledge about the
similarity of objects, a quantifi ed relative Google semantics,
by tapping available information generated by multitudes
of web users. Here we are reminded of the words of D.H.
Rumsfeld [31] “A trained ape can know an awful lot / Of
what is going on in this world / Just by punching on his
mouse / For a relatively modest cost!” In this paper, the
Google semantics of a word or phrase consists of the set
of web pages returned by the query concerned.
A. An Example:: While the theory we propose is rather
intricate, the resulting method is simple enough. We give
an example: At the time of doing the experiment, a Google
search for “horse”, returned 46,700,000 hits. The number of
hits for the search term “rider” was 12,200,000. Searching
for the pages where both “horse” and “rider” occur gave
2,630,000 hits, and Google indexed 8,058,044,651 web
pages. Using these numbers in the main formula (III.3)
we derive below, with N = 8, 058, 044, 651, this yields a
Normalized Google Distance between the terms “horse”
and “rider” as follows:
NGD(horse, rider) ≈ 0.443.
In the sequel of the paper we argue that the NGD is a
normed semantic distance between the terms in question,
usually (but not always, see below) in between 0 (identical)
and 1 (unrelated), in the cognitive space invoked by the
usage of the terms on the world-wide-web as fi ltered
by Google. Because of the vastness and diversity of the
web this may be taken as related to the current use of
the terms in society. We did the same calculation when
Google indexed only one-half of the current number of
pages: 4,285,199,774. It is instructive that the probabilities
of the used search terms didn’t change signifi cantly over
this doubling of pages, with number of hits for “horse”
equal 23,700,000, for “rider” equal 6,270,000, and for
“horse, rider” equal to 1,180,000. The NGD(horse, rider)
we computed in that situation was ≈ 0.460. This is in line
with our contention that the relative frequencies of web
pages containing search terms gives objective information
about the semantic relations between the search terms. If
this is the case, then the Google probabilities of search
terms and the computed NGD ’s should stabilize (become
scale invariant) with a growing Google database.
B. Related Work:: There is a great deal of work in
both cognitive psychology [37], linguistics, and computer
science, about using word (phrases) frequencies in text
corpora to develop measures for word similarity or word
association, partially surveyed in [34], [36], going back to
at least [35]. One of the most successful is Latent Semantic
Analysis (LSA) [37] that has been applied in various forms
in a great number of applications. We discuss LSA and
its relation to the present approach in Appendix VII. As
with LSA, many other previous approaches of extracting
corollations from text documents are based on text corpora
that are many order of magnitudes smaller, and that are in
local storage, and on assumptions that are more refi ned,
than what we propose. In contrast, [11], [1] and the many
references cited there, use the web and Google counts to
identify lexico-syntactic patterns or other data. Again, the
theory, aim, feature analysis, and execution are different
from ours, and cannot meaningfully be compared. Essentially, our method below automatically extracts semantic
relations between arbitrary objects from the web in a
manner that is feature-free, up to the search-engine used,
and computationally feasible. This seems to be a new
direction altogether.
C. Outline:: The main thrust is to develop a new theory
of semantic distance between a pair of objects, based
on (and unavoidably biased by) a background contents
consisting of a database of documents. An example of
the latter is the set of pages constituting the world-wideweb. Similarity relations between pairs of objects is distilled from the documents by just using the number of
documents in which the objects occur, singly and jointly
(irrespective of location or multiplicity). For us, the Google
semantics of a word or phrase consists of the set of
web pages returned by the query concerned. Note that
this can mean that terms with different meaning have the
same semantics, and that opposites like ”true” and ”false”
often have a similar semantics. Thus, we just discover
associations between terms, suggesting a likely relationship.
As the web grows, the Google semantics may become
less primitive. The theoretical underpinning is based on
the theory of Kolmogorov complexity [27], and expressed
as coding and compression. This allows to express and
prove properties of absolute relations between objects that
cannot even be expressed by other approaches. The theory,
application, and the particular NGD formula to express
the bilateral semantic relations are (as far as we know) not
equivalent to any earlier theory, application, and formula
in this area. The current paper is a next step in a decade
of cumulative research in this area, of which the main
thread is [27], [2], [28], [26], [7], [8] with [25], [3] using
the related approach of [29]. We fi rst start with a technical introduction outlining some notions underpinning our
approach: Kolmogorov complexity, information distance,
and compression-based similarity metric (Section II). Then
we give a technical description of the Google distribution,
the Normalized Google Distance, and the universality of
these notions (Section III). While it may be possible in
principle that other methods can use the entire world-wideweb to determine semantic similarity between terms, we
do not know of a method that both uses the entire web,
or computationally can use the entire web, and (or) has
the same aims as our method. To validate our method we
therefore cannot compare its performance to other existing
methods. Ours is a new proposal for a new task. We validate
the method in the following way: by theoretical analysis,
by anecdotical evidence in a plethora of applications, and
by systematic and massive comparison of accuracy in a
classifi cation application compared to the uncontroversial
body of knowledge in the WordNet database. In Section III
we give the theoretic underpinning of the method and prove
its universality. In Section IV we present a plethora of
clustering and classifi cation experiments to validate the
universality, robustness, and accuracy of our proposal. In
Section V we test repetitive automatic performance against
uncontroversial semantic knowledge: We present the results
of a massive randomized classifi cation trial we conducted to
gauge the accuracy of our method to the expert knowledge
as implemented over the decades in the WordNet database.
The preliminary publication [9] of this work on the web
archives was widely reported and discussed, for example
[16], [17]. The actual experimental data can be downloaded
from [5]. The method is implemented as an easy-to-use
software tool available on the web [6], available to all.
D. Materials and Methods:: The application of the
theory we develop is a method that is justifi ed by the
vastness of the world-wide-web, the assumption that the
mass of information is so diverse that the frequencies of
pages returned by Google queries averages the semantic
information in such a way that one can distill a valid
semantic distance between the query subjects. It appears to
be the only method that starts from scratch, is feature-free
in that it uses just the web and a search engine to supply
contents, and automatically generates relative semantics
between words and phrases. A possible drawback of our
method is that it relies on the accuracy of the returned
counts. As noted in [1], the returned google counts are
inaccurate, and especially if one uses the boolean OR
operator between search terms, at the time of writing. The
AND operator we use is less problematic, and we do not use
the OR operator. Furthermore, Google apparently estimates
the number of hits based on samples, and the number of
indexed pages changes rapidly. To compensate for the latter
effect, we have inserted a normalizing mechanism in the
CompLearn software. Generally though, if search engines
have peculiar ways of counting number of hits, in large
part this should not matter, as long as some reasonable
conditions hold on how counts are reported. Linguists
judge the accuracy of Google counts trustworthy enough:
In [23] (see also the many references to related research)
it is shown that web searches for rare two-word phrases
correlated well with the frequency found in traditional
corpora, as well as with human judgments of whether
those phrases were natural. Thus, Google is the simplest
means to get the most information. Note, however, that a
single Google query takes a fraction of a second, and that
Google restricts every IP address to a maximum of (currently) 500 queries per day—although they are cooperative
enough to extend this quotum for noncommercial purposes.
The experimental evidence provided here shows that the
combination of Google and our method yields reasonable
results, gauged against common sense (‘colors’ are different
from ‘numbers’) and against the expert knowledge in the
WordNet data base. A reviewer suggested downscaling our
method by testing it on smaller text corpora. This does
not seem useful. Clearly perfomance will deteriorate with
decreasing data base size. A thought experiment using the
extreme case of a single web page consisting of a single
term suffi ces. Practically addressing this issue is begging
the question. Instead, in Section III we theoretically analyze
the relative semantics of search terms established using all
of the web, and its universality with respect to the relative
semantics of search terms using subsets of web pages.
II. Technical Preliminaries
The basis of much of the theory explored in this paper is
Kolmogorov complexity. For an introduction and details see
the textbook [27]. Here we give some intuition and notation.
We assume a fi xed reference universal programming system. Such a system may be a general computer language
like LISP or Ruby, and it may also be a fi xed reference
universal Turing machine in a given standard enumeration
of Turing machines. The latter choice has the advantage
of being formally simple and hence easy to theoretically
manipulate. But the choice makes no difference in principle, and the theory is invariant under changes among
the universal programming systems, provided we stick to a
particular choice. We only consider universal programming
systems such that the associated set of programs is a prefi x
code—as is the case in all standard computer languages.
The Kolmogorov complexity of a string x is the length, in
bits, of the shortest computer program of the fi xed reference
computing system that produces x as output. The choice of
computing system changes the value of K(x) by at most an
additive fi xed constant. Since K(x) goes to infi nity with x,
this additive fi xed constant is an ignorable quantity if we
consider large x. One way to think about the Kolmogorov
complexity K(x) is to view it as the length, in bits, of the
ultimate compressed version from which x can be recovered
by a general decompression program. Compressing x using
the compressor gzip results in a fi le xg with (for fi les that
contain redundancies) the length |xg | < |x|. Using a better
compressor bzip2 results in a fi le xb with (for redundant
fi les) usually |xb | < |xg |; using a still better compressor like
PPMZ results in a fi le xp with (for again appropriately
redundant fi les) |xp | < |xb |. The Kolmogorov complexity
K(x) gives a lower bound on the ultimate value: for every
existing compressor, or compressors that are possible but
not known, we have that K(x) is less or equal to the
length of the compressed version of x. That is, K(x)
gives us the ultimate value of the length of a compressed
version of x (more precisely, from which version x can
be reconstructed by a general purpose decompresser), and
our task in designing better and better compressors is to
approach this lower bound as closely as possible.
A. Normalized Information Distance:: In [2] we considered the following notion: given two strings x and y,
what is the length of the shortest binary program in the
reference universal computing system such that the program
computes output y from input x, and also output x from
input y. This is called the information distance and denoted
as E(x, y). It turns out that, up to a negligible logarithmic
additive term,
E(x, y) = K(x, y) − min{K(x), K(y)}.
This distance E(x, y) is actually a metric: up to close
precision we have E(x, x) = 0, E(x, y) > 0 for x 6= y,
E(x, y) = E(y, x) and E(x, y) ≤ E(x, z)+E(z, y), for all x, y, z.
We now consider a large class of admissible distances:
all distances (not necessarily metric) that are nonnegative,
symmetric, and computable in the sense that for every such
distance D there is a prefi x program that, given two strings
x and y, has binary length equal to the distance D(x, y)
between x and y. Then,
E(x, y) ≤ D(x, y) + cD ,
(II.1)
where cD is a constant that depends only on D but not
on x, y, and we say that E(x, y) minorizes D(x, y) up to
an additive constant. We call the information distance E
universal for the family of computable distances, since the
former minorizes every member of the latter family up to an
additive constant. If two strings x and y are close according
to some computable distance D, then they are at least as
close according to distance E. Since every feature in which
we can compare two strings can be quantifi ed in terms of
a distance, and every distance can be viewed as expressing
a quantifi cation of how much of a particular feature the
strings have in common (the feature being quantifi ed by that
distance), the information distance determines the distance
between two strings according to the dominant feature in
which they are similar. This means that, if we consider more
than two strings, every pair may have a different dominant
feature. If small strings differ by an information distance
which is large compared to their sizes, then the strings are
very different. However, if two very large strings differ by
the same (now relatively small) information distance, then
they are very similar. Therefore, the information distance
itself is not suitable to express true similarity. For that
we must defi ne a relative information distance: we need
to normalize the information distance. Such an approach
was fi rst proposed in [25] in the context of genomics-based
phylogeny, and improved in [26] to the one we use here.
The normalized information distance ( NID ) has values
between 0 and 1, and it inherits the universality of the
information distance in the sense that it minorizes, up to
a vanishing additive term, every other possible normalized
computable distance (suitably defi ned). In the same way as
before we can identify the computable normalized distances
with computable similarities according to some features,
and the NID discovers for every pair of strings the feature
in which they are most similar, and expresses that similarity
on a scale from 0 to 1 (0 being the same and 1 being
completely different in the sense of sharing no features).
Considering a set of strings, the feature in which two strings
are most similar may be a different one for different pairs
of strings. The NID is defi ned by
NID(x, y) =
K(x, y) − min(K(x), K(y))
.
max(K(x), K(y))
(II.2)
It has several wonderful properties that justify its
description as the most informative metric [26].
B. Normalized Compression Distance:: The NID
is uncomputable since the Kolmogorov complexity is
uncomputable. But we can use real data compression
programs to approximate the Kolmogorov complexities
K(x), K(y), K(x, y). A compression algorithm defi nes a
computable function from strings to the lengths of the
compressed versions of those strings. Therefore, the number
of bits of the compressed version of a string is an upper
bound on Kolmogorov complexity of that string, up to an
additive constant depending on the compressor but not on
the string in question. Thus, if C is a compressor and we
use C(x) to denote the length of the compressed version of
a string x, then we arrive at the Normalized Compression
Distance:
NCD(x, y) =
C(xy) − min(C(x),C(y))
,
max(C(x),C(y))
source of debate. Our contention that the web is such a large
and diverse text corpus, and Google such an able extractor,
that the relative page counts approximate the true societal
word- and phrases usage, starts to be supported by current
real linguistics research [38], [23].
(II.3)
A. The Google Distribution:: Let the set of singleton
Google
search terms be denoted by S . In the sequel we
where for convenience we have replaced the pair (x, y)
use
both
singleton search terms and doubleton search terms
in the formula by the concatenation xy. This transition
{{x,
y}
:
x,
y ∈ S }. Let the set of web pages indexed (possiraises several tricky problems, for example how the NCD
ble
of
being
returned) by Google be Ω. The cardinality of
approximates the NID if C approximates K, see [8], which
Ω
is
denoted
by M = |Ω|, and at the time of this writing
do not need to concern us here. Thus, the NCD is actually a
9 ≤ M ≤ 9 · 109 (and presumably greater by the time of
8
·
10
family of compression functions parameterized by the given
data compressor C. The NID is the limiting case, where reading this). Assume that a priori all web pages are equiK(x) denotes the number of bits in the shortest code for x probable, with the probability of being returned by Google
from which x can be decompressed by a general purpose being 1/M. A subset of Ω is called an event. Every search
term x usable by Google defi nes a singleton Google event
computable decompressor.
x ⊆ Ω of web pages that contain an occurrence of x and are
returned by Google if we do a search for x. Let L : Ω → [0, 1]
III. Theory of Googling for Similarity
be the uniform mass probability function. The probability
Every text corpus or particular user combined with a
of such an eventT
x is L(x) = |x|/M. Similarly, the doubleton
frequency extractor defi nes its own relative frequencies
Google event x y ⊆ Ω is the set of web pages returned
of words and phrases usage. In the world-wide-web and
by Google if we do a search for pages containing both
Google setting there are millions of users and text corpora,
search term Tx and search
term y. The probability of this
T
each with its own distribution. In the sequel, we show
event is L(x y) = |x y|/M. We can also defi ne the other
S
T
(and prove) that the Google distribution is universal for
Boolean combinations: ¬x = Ω\x and x y = ¬(¬x ¬y),
all the individual web users distributions. The number of
each such event having a probability equal to its cardinality
web pages currently indexed by Google is approaching
divided by M. If e is an event obtained from the basic
1010. Every common search term occurs in millions of
events x, y, . . ., corresponding to basic search terms x, y, . . .,
web pages. This number is so vast, and the number of
by fi nitely many applications of the Boolean operations,
web authors generating web pages is so enormous (and
then the probability L(e) = |e|/M.
can be assumed to be a truly representative very large
sample from humankind), that the probabilities of Google
B. Google Semantics:: Google events capture in a parsearch terms, conceived as the frequencies of page counts
ticular sense all background knowledge about the search
returned by Google divided by the number of pages indexed
terms concerned available (to Google) on the web.
by Google, approximate the actual relative frequencies of
those search terms as actually used in society. Based on
The Google event x, consisting of the set of all
this premise, the theory we develop in this paper states
web pages containing one or more occurrences
that the relations represented by the Normalized Google
of the search term x, thus embodies, in every
Distance (III.3) approximately capture the assumed true
possible sense, all direct context in which x
semantic relations governing the search terms. The NGD
occurs on the web. This constitutes the Google
formula (III.3) only uses the probabilities of search terms
semantics of the term.
extracted from the text corpus in question. We use the
world-wide-web and Google, but the same method may be R EMARK 3.1. It is of course possible that parts of this
used with other text corpora like the King James version of direct contextual material link to other web pages in which
the Bible or the Oxford English Dictionary and frequency x does not occur and thereby supply additional context. In
count extractors, or the world-wide-web again and Yahoo our approach this indirect context is ignored. Nonetheless,
as frequency count extractor. In these cases one obtains indirect context may be important and future refi nements
a text corpus and frequency extractor biased semantics of of the method may take it into account.
♦
the search terms. To obtain the true relative frequencies of
words and phrases in society is a major problem in applied C. The Google Code:: The event x consists of all poslinguistic research. This requires analyzing representative sible direct knowledge on the web regarding x. Therefore,
random samples of suffi cient sizes. The question of how it is natural to consider code words for those events as
to sample randomly and representatively is a continuous coding this background knowledge. However, we cannot
use the probability of the events directly to determine a
prefi x code, or, rather the underlying information content
implied by the probability. The reason is that the events
overlap and hence the summed probability exceeds 1. By
the Kraft inequality [12] this prevents a corresponding set
of code-word lengths. The solution is to normalize: We use
the probability of the Google events to defi ne a probability
mass function over the set {{x, y} : x, y ∈ S } of Google
search terms, both singleton and
doubleton terms. There
are |S | singleton terms, and |S2 | doubletons consisting of
a pair of non-identical terms. Defi ne
N=
∑
{x,y}⊆S
|x
\
y|,
counting each singleton set and each doubleton set (by
defi nition unordered) once in the summation. Note that this
means that T
for every pair {x, y} ⊆ S , with x 6= y, the web
T
pages z ∈ x y are counted three times: once in x = x x,
T
T
once in y = y y, and once in x y. Since every web page
that is indexed by Google contains at least one occurrence
of a search term, we have N ≥ M. On the other hand, web
pages contain on average not more than a certain constant
α search terms. Therefore, N ≤ αM. Defi ne
g(x) = g(x, x), g(x, y) = L(x
\
y)M/N = |x
\
y|/N.
(III.1)
Then, ∑{x,y}⊆S g(x, y) = 1. This g-distribution changes over
time, and between different samplings from the distribution. But let us imagine that g holds in the sense of
an instantaneous snapshot. The real situation will be an
approximation of this. Given the Google machinery, these
are absolute probabilities which allow us to defi ne the
associated prefi x code-word lengths (information contents)
for both the singletons and the doubletons. The Google code
G is defi ned by
G(x) = G(x, x), G(x, y) = log 1/g(x, y).
(III.2)
D. The Google Similarity Distance:: In contrast to
strings x where the complexity C(x) represents the length
of the compressed version of x using compressor C, for a
search term x (just the name for an object rather than the
object itself), the Google code of length G(x) represents the
shortest expected prefi x-code word length of the associated
Google event x. The expectation is taken over the Google
distribution p. In this sense we can use the Google distribution as a compressor for the Google semantics associated
with the search terms. The associated NCD , now called
the normalized Google distance ( NGD ) is then defi ned by
(III.3), and can be rewritten as the right-hand expression:
NGD(x, y)
=
=
G(x, y) − min(G(x), G(y))
(III.3)
max(G(x), G(y))
max{log f (x), log f (y)} − log f (x, y)
,
log N − min{log f (x), log f (y)}
where f (x) denotes the number of pages containing x, and
f (x, y) denotes the number of pages containing both x and
y, as reported by Google. This NGD is an approximation
to the NID of (II.2) using the prefi x code-word lengths
(Google code) generated by the Google distribution as
defi ning a compressor approximating the length of the
Kolmogorov code, using the background knowledge on the
web as viewed by Google as conditional information. In
practice, use the page counts returned by Google for the
frequencies, and we have to choose N. From the right-hand
side term in (III.3) it is apparent that by increasing N we
decrease the NGD , everything gets closer together, and
by decreasing N we increase the NGD , everything gets
further apart. Our experiments suggest that every reasonable
(M or a value greater than any f (x)) value can be used
as normalizing factor N, and our results seem in general
insensitive to this choice. In our software, this parameter
N can be adjusted as appropriate, and we often use M for
N. The following are the main properties of the NGD (as
long as we choose parameter N ≥ M):
1) The range of the NGD is in between 0 and ∞
(sometimes slightly negative if the Google counts
are untrustworthy and state f (x, y) > min{ f (x), f (y)},
See Section I-D);
a) If x = y or if x 6= y but frequency f (x) = f (y) =
f (x, y) > 0, then NGD(x, y) = 0. That is, the
semantics of x and y in the Google sense is the
same.
b) If frequency f (x) = 0, then for every search term
y we have f (x, y) = 0, and the NGD(x, y) =
log f (y)/ log(N/ f (y)).
2) The NGD is always nonnegative and NGD(x, x) = 0
for every x. For every pair x, y we have NGD(x, y) =
NGD(y, x): it is symmetric. However, the NGD
is not a metric: it does not satisfy NGD(x, y) > 0
for every x 6= y. As before, let x denote the set of
web pages containing one or more occurrences of
x. For example, choose x 6= y with x = y. Then,
f (x) = f (y) = f (x, y) and NGD(x, y) = 0. Nor does
the NGD satisfy the triangle inequality NGD(x, y) ≤
NGD(x, z) + NGD(z, y) for all x, y, z. For example,
S
T
T
T
/ x = x z, y = y z,
choose z = x √
y, x y = 0,
and |x| =√
|y| = N. Then,
√ f (x) = f (y) = f (x, z) =
f (y, z) = N, f (z) = 2 N, and f (x, y) = 0. This
yields NGD(x, y) = 1 and NGD(x, z) = NGD(z, y) =
2/ log(N/4), which violates the triangle inequality for
N > 64.
3) The NGD is scale-invariant in the following sense:
If the number N of pages indexed by Google (accounting for the multiplicity of different search terms
per page) grows, the number of pages containing
a given search term goes to a fi xed fraction of N,
and so does the number of pages containing a given
conjunction of search terms. This means that if N
doubles, then so do the f -frequencies. For the NGD
to give us an objective semantic relation between
search terms, it needs to become stable when the
number N grows unboundedly.
E. Universality of Google Distribution:: A central notion in the application of compression to learning is the
notion of “universal distribution,” see [27]. Consider an
effective enumeration P = p1 , p2 , . . . of probability mass
functions with domain S . The list P can be fi nite or
countably infi nite.
D EFINITION 3.2. A probability mass function pu occurring
in P is universal for P , if for every pi in P there is a
constant ci > 0 and ∑i6=u ci ≥ 1, such that for every x ∈ S we
have pu (x) ≥ ci · pi (x). Here ci may depend on the indexes
u, i, but not on the functional mappings of the elements of
list P nor on x.
Then, ∑{x,y}⊆S gi (x, y) = 1.
T HEOREM 3.1. Let Ω1 , . . . , Ωa be any partition of Ω into
subsets (web authors), and let g1 , . . . , ga be the corresponding individual Google distributions. Then the Google
distribution g is universal for the enumeration g, g1 , . . . , ga .
Proof: We can express the overall Google distribution
in terms of the individual web author’s distributions:
Ni
g(x, y) = ∑ gi (x, y).
i∈A N
Consequently, g(x, y) ≥ (Ni /N)gi (x, y). Since also g(x, y) ≥
g(x, y), we have shown that g(x, y) is universal for the
family g, g1 , . . . , ga of individual web author’s google distributions, according to Defi nition 3.2.
R EMARK 3.3. Let us show that, for example, the uniform
distribution L(x) = 1/s (s = |S |) over the search terms x ∈ S
is not universal, for s > 2. By the requirement ∑ ci ≥ 1, the
sum taken over the number a of web authors in the list
If pu is universal for P , then it immediately follows A , there is an i such that ci ≥ 1/a. Taking the uniform
that for every pi in P , the prefi x code-word length for distribution on say s search terms assigns probability 1/s
source word x, see [12], associated with pu , minorizes the to each of them. By the defi nition of universality of a
prefi x code-word length associated with pi , by satisfying probability mass function for the list of individual Google
log 1/pu (x) ≤ log 1/pi (x) + logci , for every x ∈ S .
probability mass functions gi , we can choose the function
In the following we consider partitions of the set of gi freely (as long as a ≥ 2, and there is another function g j
web pages, each subset in the partition together with a to exchange probabilities of search terms with). So choose
probability mass function of search terms. For example, some search term x and set gi (x) = 1, and gi (y) = 0 for all
we may consider the list A = 1, 2, . . . , a of web authors search terms y 6= x. Then, we obtain g(x) = 1/s ≥ ci gi (x) =
producing pages on the web, and consider the set of web 1/a. This yields the required contradiction for s > a ≥ 2.
pages produced by each web author, or some other partition.
♦
“Web author” is just a metaphor we use for convenience.
Let web author i of the list A produce the set of web pages F. Universality of Normalized Google Distance: :
Ωi and denote Mi = |Ωi |. We identify a web author i with Every individual web author produces both an individual
the set of web pages Ωi he produces. Since we have no Google distribution gi , and an individual prefix code-word
knowledge of the set of web authors, we consider every length Gi associated with gi (see [12] for this code) for the
possible partion of Ω into one of more equivalence classes, search terms.
S S
T
Ω = Ω1 · · · Ωa , Ωi Ω j = 0/ (1 ≤ i 6= j ≤ a ≤ |Ω|), as
D EFINITION 3.4. The associated individual normalized
defi ning a realizable set of web authors A = 1, . . . , a.
Google distance NGD i of web author i is defi ned according
Consider a partition of Ω into Ω1 , . . . , Ωa . A search
to (III.3), with Gi substituted for G.
term x usable by Google defi nes an event xi ⊆ Ωi of web
pages produced by web author i that contain search term These Google distances NGD i can be viewed as the
T
x. Similarly, xi yi is the set of web pages produced by i individual semantic distances according to the bias of web
that is returned by Google searching for pages containing author i. These individual semantics are subsumed in the
both search term x and search term y. Let
general Google semantics in the following sense: The
\
normalized Google distance is universal for the family of
Ni = ∑ |xi yi |.
individual normalized Google distances, in the sense that
{x,y}⊆S
it is as about as small as the least individual normalized
Note that there is an αi ≥ 1 such that Mi ≤ Ni ≤ αi Mi . For Google distance, with high probability. Hence the Google
every search term x ∈ S defi ne a probability mass function semantics as evoked by all of the web society in a certain
gi , the individual web author’s Google distribution, on the sense captures the biases or knowledge of the individual
web authors. In Theorem 3.2 we show that, for every k ≥ 1,
sample space {{x, y} : x, y ∈ S} by
the inequality
\
gi (x) = gi (x, x), gi (x, y) = |xi yi |/Ni .
(III.4)
NGD(x, y) < β NGD (x, y) + γ,
(III.5)
i
with β = (min{Gi (x), Gi (y) − log k)/ min{Gi (x), Gi (y)} and log N/Ni in the middle term of (III.3), we obtain
γ = (log kN/Ni )/ min{G(x), G(y)}, is satisfi ed with probaGi (x, y) − max{G(x), G(y)} + logN/Ni
bility going to 1 with growing k.
.
NGD(x, y) ≤
min{G(x), G(y)}
R EMARK 3.5. To interpret (III.5), we observe that in case
(III.6)
Gi (x) and Gi (y) are large with respect to log k, then β ≈ 1. Markov’s Inequality says the following: Let p be any
If moreover log N/Ni is large with respect to log k, then probability mass function; let f be any nonnegative function
γ ≈ (log N/Ni )/ min{G(x), G(y)}. Let us estimate γ under with p-expected value E = ∑i p(i) f (i) < ∞. For E > 0 we
reasonable assumptions. Without loss of generality assume have ∑i {p(i) : f (i)/E > k} < 1/k.
G(x) ≤ G(y). If f (x) = |x|, the number of pages returned
(i) Defi ne a probability mass function p(i) = Ni /N
on query x, then G(x) = log(N/ f (x)). Thus, γ ≈ (log N − for i ∈ A . For every x ∈ S , the Google probability mass
log Ni )/(log N − log f (x)). The uniform expectation of Ni value g(x) equals the expectation of the individual Google
is N/|A |, and N divided by that expectation of Ni equals probability mass values at x, that is, ∑i∈A p(i)gi (x). We can
|A |, the number of web authors producing web pages. The now argue as follows: for every x as above, we can consider
uniform expectation of f (x) is N/|S |, and N divided by x fi xed and use “gi (x)” for “ f (i)”, and “g(x)” for “E” in
that expectation of f (x) equals |S |, the number of Google Markov’s Inequality. Then, ∑i {p(i) : gi (x)/g(x) > k} < 1/k,
search terms we use. Thus, the more the number of search and therefore ∑i {p(i) : gi (x)/g(x) ≥ k} > 1 − 1/k. Hence,
terms exceeds the number of web authors, the more γ goes there is a x-dependent subset of web authors Bx = {i ∈ A :
to 0 in expectation.
♦ gi (x)/g(x) ≥ k}, such that ∑{p(i) : i ∈ B } > 1 − 1/k. By
defi nition gi (x) ≥ kg(x) for i ∈ Bx . Then, for i ∈ Bx , we have
R EMARK 3.6. To understand (III.5), we may consider the
Gi (x) − log k ≤ G(x). Substitute Gi (x) − log k for G(x), with
codelengths involved as the Google database changes over
probability > (1 − 1/k) that Gi (x) − log k ≤ G(x) for web
time. It is reasonable to expect that both the total number
author i that authored a web page that is selected uniformly
of pages as well as the total number of search terms
at random from the set Ω of all web pages. Recall that
in the Google database will continue to grow for some
{p(i) : i ∈ Bx } = ∑{Ni : i ∈ Bx }/N. Similarly this holds for
time. In this period, the sum total probability mass will ∑
search term y with respect to the equivalently defi ned By .
be carved up into increasingly smaller pieces for more and
The combination of search terms x and y therefore satisfy
more search terms. The maximum singleton and doubleton
both Gi (x)−log k ≤ G(x) and Gi (y)−log k ≤ G(y), for i ∈ B
T
codelengths within the Google database will grow. But
with B = Bx By . Then, ∑{pi : i ∈ B } = ∑{Ni : i ∈ B }/N ≥
the universality property of the Google distribution implies
{N : i ∈ Bx }/N × ∑{Ni : i ∈ By }/N > (1 − 1/k)2 . Hence,
that the Google distribution’s code length for almost all ∑ i
the probability that both search terms x and y satisfy Gi (x)−
particular search terms will be within a negligible error
log k ≤ G(x) and Gi (y) − log k ≤ G(y), respectively, for
of the best codelength among any of the individual web
web author i that authored a web page selected uniformly
authors. The size of this gap will grow more slowly than
at random from Ω, is greater than (1 − 1/k)2 . Substitute
the codelength for any particular search term over time.
both G(x) and G(y) according to these probabilistically
Thus, the coding space that is suboptimal in the Google
satisfi ed inequalities in (III.6), both in the max-term in the
distribution’s code is an ever-smaller piece (in terms of
numerator, and in the min-term in the denominator. This
proportion) of the total coding space.
♦
proves item (i).
T HEOREM 3.2. (i) For every pair of search terms x, y,
(ii) Fix web author i ∈ A . We consider the conditional
the set of web authors B for which (III.5) holds, has probability mass functions g′ (x) = g(x|x ∈ S ) and g′i (x) =
∑{Ni /N : i ∈ B } > (1 − 1/k)2. That is, if we select a web gi (x|x ∈ S ) over single search terms: The g′i -expected value
page uniformly at random from the total set Ω of web pages, of g′ (x)/g′i (x) is
then we have probability > (1 − 1/k)2 that (III.5) holds for
g′ (x)
web author i that authored the selected web page.
g′i (x) ′
≤ 1.
∑
gi (x)
(ii) For every web author i ∈ A , the gi -probability
x
concentrated on the pairs of search terms for which (III.5)
Then, by Markov’s Inequality
holds is at least (1 − 2/k)2.
1
Proof: The prefi x code-word lengths Gi associated
(III.7)
∑{g′i(x) : g′(x) ≤ jg′i (x)} > 1 − j .
x
with gi satisfy G(x) ≤ Gi (x) + log N/Ni and G(x, y) ≤
Gi (x, y) + log N/Ni . Substituting G(x, y) by Gi (x, y) +
Let ∑x∈S g(x) = h and ∑x∈S gi (x) = hi . Since the probability
of an event of a doubleton set of search terms is not greater
than that of an event based on either of the constituent
search terms, 1 ≥ h, hi ≥ 1/2. Therefore, 2g(x) ≥ g′ (x) ≥
g(x) and 2gi (x) ≥ g′i (x) ≥ gi (x). Then, for the search terms
x satisfying (III.7), we have
1
∑{gi(x) : g(x) ≤ 2 jgi (x)} > 1 − j .
x
For the x’s with g(x) ≤ 2 jgi (x) we have Gi (x) ≤ G(x) +
log 2 j. Substitute Gi (x) − log 2 j for G(x) (there is gi probability ≥ 1 − 1/ j that Gi (x) − log 2 j ≤ G(x)) and
Gi (y) − log 2 j ≤ G(y) in (III.6), both in the max-term in the
numerator, and in the min-term in the denominator. Noting
that the two gi -probabilities (1 − 1/ j) are independent, the
total gi -probability that both substitutions are justifi ed is at
least (1 − 1/ j)2. Substituting k = 2 j proves item (ii).
Therefore, the Google normalized distance minorizes
every normalized compression distance based on a particular user’s generated probabilities of search terms, with
high probability up to an error term that in typical cases is
ignorable.
IV. Applications and Experiments
A. Hierarchical Clustering: : We used our software tool
available from http://www.complearn.org, the same tool that
has been used in our earlier papers [8], [7] to construct trees
representing hierarchical clusters of objects in an unsupervised way. However, now we use the normalized Google
distance ( NGD ) instead of the normalized compression
distance ( NCD ). The method works by fi rst calculating a
distance matrix whose entries are the pairswise NGD ’s of
the terms in the input list. Then calculate a best-matching
unrooted ternary tree using a novel quartet-method style
heuristic based on randomized hill-climbing using a new
fi tness objective function for the candidate trees. Let us
briefly explain what the method does; for more explanation
see [10], [8]. Given a set of objects as points in a space
provided with a (not necessarily metric) distance measure,
the associated distance matrix has as entries the pairwise
distances between the objects. Regardless of the original
space and distance measure, it is always possible to confi gure n objects is n-dimensional Euclidean space in such a
way that the associated distances are identical to the original
ones, resulting in an identical distance matrix. This distance
matrix contains the pairwise distance relations according
to the chosen measure in raw form. But in this format
that information is not easily usable, since for n > 3 our
cognitive capabilities rapidly fail. Just as the distance matrix
is a reduced form of information representing the original
data set, we now need to reduce the information even
further in order to achieve a cognitively acceptable format
like data clusters. To extract a hierarchy of clusters from
the distance matrix, we determine a dendrogram (ternary
tree) that agrees with the distance matrix according to a
fi delity measure. This allows us to extract more information
from the data than just flat clustering (determining disjoint
clusters in dimensional representation). This method does
not just take the strongest link in each case as the “true”
one, and ignore all others; instead the tree represents all the
relations in the distance matrix with as little distortion as
is possible. In the particular examples we give below, as in
all clustering examples we did but not depicted, the fi delity
was close to 1, meaning that the relations in the distance
matrix are faithfully represented in the tree. The objects
to be clustered are search terms consisting of the names
of colors, numbers, and some tricky words. The program
automatically organized the colors towards one side of the
tree and the numbers towards the other, Figure 1. It arranges
the terms which have as only meaning a color or a number,
and nothing else, on the farthest reach of the color side and
the number side, respectively. It puts the more general terms
black and white, and zero, one, and two, towards the center,
thus indicating their more ambiguous interpretation. Also,
things which were not exactly colors or numbers are also
put towards the center, like the word “small”. As far as
the authors know there do not exist other experiments that
create this type of semantic distance automatically from
the web using Google or similar search engines. Thus,
there is no baseline to compare against; rather the current
experiment can be a baseline to evaluate the behavior of
future systems.
B. Dutch 17th Century Painters: : In the example
of Figure 2, the names of fi fteen paintings by Steen,
Rembrandt, and Bol were entered. We use the full name as a
single Google search term (also in the next experiment with
book titles). In the experiment, only painting title names
were used; the associated painters are given below. We do
not know of comparable experiments to use as baseline
to judge the performance; this is a new type of contents
clustering made possible by the existence of the web and
search engines. The painters and paintings used are as
follows:
Rembrandt van Rijn: Hendrickje slapend; Portrait of
Maria Trip; Portrait of Johannes Wtenbogaert ; The Stone
Bridge ; The Prophetess Anna ;
Jan Steen: Leiden Baker Arend Oostwaert ; Keyzerswaert ; Two Men Playing Backgammon ; Woman at her
Toilet ; Prince’s Day ; The Merry Family ;
Ferdinand Bol: Maria Rey ; Consul Titus Manlius
Torquatus ; Swartenhout ; Venus and Adonis .
C. English Novelists: : Another example is English
novelists. The authors and texts used are:
William Shakespeare: A Midsummer Night’s Dream;
Julius Caesar; Love’s Labours Lost; Romeo and Juliet .
Jonathan Swift: The Battle of the Books; Gulliver’s
Travels; Tale of a Tub; A Modest Proposal;
Oscar Wilde: Lady Windermere’s Fan; A Woman of
orange
purple
yellow
n7
n6
n5
chartreuse
n13
blue
green
n14
red
n10
zero
black
n4
n8
two
n12
white
transparent
n1
n16
n3
n17
n18
one
small
three
n0
n15
five
n2
four
six
n11
fortytwo
n9
eight
seven
Fig. 1.
Colors and numbers arranged into a tree using NGD .
Leiden Baker Arend Oostwaert
Two Men Playing Backgammon
Portrait of Johannes Wtenbogaer
k6
Keyzerswaert
k4
Swartenhout
Hendrickje slapend
k10
k0
k8
k5
Prince’s Day
Portrait of Maria Trip
k1
k7
k2
k3
The Stone Bridge
The Merry Family
k9
The Prophetess Anna
Woman at her Toilet
k11
Consul Titus Manlius Torquatus
k12
Venus and Adonis
Maria Rey
complearn version 0.8.19
tree score S(T) = 0.940019
compressor: google
Username: cilibrar
Fig. 2.
Hierarchical clustering of pictures
Tale of a Tub
The Battle of the Books
k5
Love’s Labours Lost
A Midsummer Night’s Dream
A Modest Proposal
k7
k0
k4
k6
Romeo and Juliet
k2
Gulliver’s Travels
k1
k3
Julius Caesar
Salome
k9
The Picture of Dorian Gray
k8
A Woman of No Importance
Lady Windermere’s Fan
complearn version 0.8.19
tree score S(T) = 0.940416
compressor: google
Username: cilibrar
Fig. 3.
Hierarchical clustering of authors
No Importance; Salome; The Picture of Dorian Gray.
The clustering is given in Figure 3, and to provide
a feeling for the fi gures involved we give the associated
NGD matrix in Figure 4. The S(T ) value in Figure 3
gives the fi delity of the tree as a representation of the
pairwise distances in the NGD matrix (S(T ) = 1 is perfect
and S(T ) = 0 is as bad as possible. For details see [6],
[8]). The question arises why we should expect this. Are
names of artistic objects so distinct? (Yes. The point also
being that the distances from every single object to all
other objects are involved. The tree takes this global aspect
into account and therefore disambiguates other meanings
of the objects to retain the meaning that is relevant for this
collection.) Is the distinguishing feature subject matter or
title style? In these experiments with objects belonging to
the cultural heritage it is clearly a subject matter. To stress
the point we used “Julius Caesar” of Shakespeare. This
term occurs on the web overwhelmingly in other contexts
and styles. Yet the collection of the other objects used,
and the semantic distance towards those objects, given by
the NGD formula, singled out the semantics of “Julius
Caesar” relevant to this experiment. Term co-occurrence in
this specifi c context of author discussion is not swamped
by other uses of this common English term because of
the particular form of the NGD and the distances being
pairwise. Using book titles which are common words, like
”Horse” and ”Rider” by author X, supposing they exist, this
swamping effect will presumably arise. Does the system
gets confused if we add more artists? (Representing the
NGD matrix in bifurcating trees without distortion becomes
more diffi cult for, say, more than 25 objects. See [8].) What
about other subjects, like music, sculpture? (Presumably,
the system will be more trustworthy if the subjects are more
common on the web.) These experiments are representative
for those we have performed with the current software. We
did not cherry-pick the best outcomes. For example, all
experiments with these three English writers, with different
selections of four works of each, always yielded a tree so
that we could draw a convex hull around the works of each
author, without overlap. Interestingly, a similar experiment
with Russian authors gave worse results. The readers can
do their own experiments to satisfy their curiosity using our
publicly available software tool at http://clo.complearn.org/,
also used in the depicted experiments. Each experi-
A Woman of No Importance
A Midsummer Night’s Dream
A Modest Proposal
Gulliver’s Travels
Julius Caesar
Lady Windermere’s Fan
Love’s Labours Lost
Romeo and Juliet
Salome
Tale of a Tub
The Battle of the Books
The Picture of Dorian Gray
Fig. 4.
0.000
0.458
0.479
0.445
0.494
0.149
0.363
0.471
0.371
0.300
0.278
0.261
0.458
-0.011
0.573
0.392
0.299
0.506
0.332
0.248
0.499
0.537
0.535
0.415
0.479
0.563
0.002
0.323
0.507
0.575
0.607
0.502
0.605
0.335
0.359
0.463
0.444
0.382
0.323
0.000
0.368
0.565
0.486
0.339
0.540
0.284
0.330
0.229
0.494
0.301
0.506
0.368
0.000
0.612
0.313
0.210
0.373
0.492
0.533
0.447
0.149
0.506
0.575
0.509
0.611
0.000
0.525
0.604
0.568
0.347
0.347
0.324
0.362
0.340
0.607
0.485
0.313
0.524
0.000
0.351
0.553
0.514
0.462
0.513
0.471
0.244
0.502
0.339
0.211
0.604
0.351
0.000
0.389
0.527
0.544
0.380
0.371
0.499
0.605
0.535
0.373
0.571
0.549
0.389
0.000
0.524
0.541
0.402
0.300
0.537
0.335
0.285
0.491
0.347
0.514
0.527
0.520
0.000
0.160
0.420
0.278
0.535
0.360
0.330
0.535
0.347
0.462
0.544
0.538
0.160
0.000
0.373
0.261
0.425
0.463
0.228
0.447
0.461
0.513
0.380
0.407
0.421
0.373
0.000
Distance matrix of pairwise NGD ’s
Training Data
Positive Training
avalanche
death threat
hurricane
rape
train wreck
(22 cases)
bomb threat
fi re
landslide
roof collapse
trapped miners
Negative Training
arthritis
dandruff
flat tire
missing dog
sore throat
Anchors
crime
wash
broken leg
flood
murder
sinking ship
burglary
gas leak
overdose
stroke
car collision
heart attack
pneumonia
tornado
(25 cases)
broken dishwasher
delayed train
frog
paper cut
sunset
broken toe
dizziness
headache
practical joke
truancy
cat in tree
drunkenness
leaky faucet
rain
vagrancy
contempt of court
enumeration
littering
roof leak
vulgarity
(6 dimensions)
happy
help
safe
urgent
Testing Results
Positive
Predictions
Negative
Predictions
Accuracy
Fig. 5.
Positive tests
assault, coma,
electrocution, heat stroke,
homicide, looting,
meningitis, robbery,
suicide
sprained ankle
Negative tests
menopause, prank call,
pregnancy, traffi c jam
acne, annoying sister,
campfi re, desk,
mayday, meal
15/20 = 75.00%
Google- SVM learning of “emergencies.”
ment can take a long time, hours, because of the
Googling, network traffi c, and tree reconstruction and layout. Don’t wait, just check for the result later. On the
web page http://clo.complearn.org/clo/listmonths/t.html the
onging cumulated results of all (in December 2005 some
160) experiments by the public, including the ones depicted
here, are recorded.
D. SVM – NGD Learning: : We augment the Google
method by adding a trainable component of the learning
system. Here we use the Support Vector Machine ( SVM
) as a trainable component. For the SVM method used in
this paper, we refer to the exposition [4]. We use LIBSVM
software for all of our SVM experiments.
The setting is a binary classifi cation problem on examples represented by search terms. We require a human expert to provide a list of at least 40 training words, consisting
of at least 20 positive examples and 20 negative examples,
to illustrate the contemplated concept class. The expert also
provides, say, six anchor words a1 , . . . , a6 , of which half are
in some way related to the concept under consideration.
Then, we use the anchor words to convert each of the
40 training words w1 , . . . , w40 to 6-dimensional training
vectors ¯v1 , . . . , ¯v40 . The entry v j,i of ¯vj = (v j,1 , . . . , v j,6 ) is
Training Data
Positive Training
11
23
41
59
73
(21 cases)
13
29
43
61
Negative Training
10
18
25
32
8
(22 cases)
12
20
26
33
9
Anchors
composite
(5 dimensions)
number
Fig. 6.
17
3
47
67
19
31
5
7
2
37
53
71
14
21
27
34
15
22
28
4
16
24
30
6
Testing Results
Positive
Predictions
Negative
Predictions
Accuracy
orange
prime
Positive tests
101, 103,
107, 109,
79, 83,
89, 91,
97
Negative tests
110
36,
40,
44,
46,
49
38,
42,
45,
48,
18/19 = 94.74%
record
Google- SVM learning of primes.
defi ned as vj,i = NGD(wi , a j ) (1 ≤ i ≤ 40, 1 ≤ j ≤ 6). The
training vectors are then used to train an SVM to learn
the concept, and then test words may be classifi ed using
the same anchors and trained SVM model.
In Figure 5, we trained using a list of “emergencies”
as positive examples, and a list of “almost emergencies” as
negative examples. The fi gure is self-explanatory. The accuracy on the test set is 75%. In Figure 6 the method learns
to distinguish prime numbers from non-prime numbers by
example. The accuracy on the test set is about 95%. This
example illustrates several common features of our method
that distinguish it from the strictly deductive techniques.
E. NGD Translation:: Yet another potential application
of the NGD method is in natural language translation. (In
the experiment below we don’t use SVM ’s to obtain
our result, but determine correlations instead.) Suppose
we are given a system that tries to infer a translationvocabulary among English and Spanish. Assume that the
system has already determined that there are fi ve words
that appear in two different matched sentences, but the
permutation associating the English and Spanish words
is, as yet, undetermined. This setting can arise in real
situations, because English and Spanish have different
rules for word-ordering. Thus, at the outset we assume
a pre-existing vocabulary of eight English words with
their matched Spanish translation. Can we infer the correct
permutation mapping the unknown words using the preexisting vocabulary as a basis? We start by forming an
NGD matrix using additional English words of which the
translation is known, Figure IV-E. We label the columns
by the translation-known English words, the rows by the
translation-unknown words. The entries of the matrix are
the NGD ’s of the English words labeling the columns
English
tooth
joy
tree
electricity
table
money
sound
music
Fig. 7.
Given starting vocabulary
English
plant
car
dance
speak
friend
Fig. 8.
Spanish
diente
alegria
arbol
electricidad
tabla
dinero
sonido
musica
Spanish
bailar
hablar
amigo
coche
planta
Unknown-permutation vocabulary
and rows. This constitutes the English basis matrix. Next,
consider the known Spanish words corresponding to the
known English words. Form a new matrix with the known
Spanish words labeling the columns in the same order as
the known English words. Label the rows of the new matrix
by choosing one of the many possible permutations of
the unknown Spanish words. For each permutation, form
the NGD matrix for the Spanish words, and compute the
pairwise correlation of this sequence of values to each of
the values in the given English word basis matrix. Choose
the permutation with the highest positive correlation. If
there is no positive correlation report a failure to extend
the vocabulary. In this example, the computer inferred the
correct permutation for the testing words, see Figure 9.
Fig. 9.
Spanish
planta
coche
bailar
hablar
amigo
Predicted (optimal) permutation
V. Systematic Comparison with WordNet Semantics
WordNet [33] is a semantic concordance of English. It
focusses on the meaning of words by dividing them into
categories. We use this as follows. A category we want
to learn, the concept, is termed, say, “electrical”, and
represents anything that may pertain to electronics. The
negative examples are constituted by simply everything
else. This category represents a typical expansion of a
node in the WordNet hierarchy. In an experiment we ran,
the accuracy on the test set is 100%: It turns out that
“electrical terms” are unambiguous and easy to learn and
classify by our method. The information in the WordNet
database is entered over the decades by human experts
and is precise. The database is an academic venture and
is publicly accessible. Hence it is a good baseline against
which to judge the accuracy of our method in an indirect
manner. While we cannot directly compare the semantic
distance, the NGD , between objects, we can indirectly
judge how accurate it is by using it as basis for a learning algorithm. In particular, we investigated how well
semantic categories as learned using the NGD – SVM
approach agree with the corresponding WordNet categories.
For details about the structure of WordNet we refer to
the offi cial WordNet documentation available online. We
considered 100 randomly selected semantic categories from
the WordNet database. For each category we executed
the following sequence. First, the SVM is trained on
50 labeled training samples. The positive examples are
randomly drawn from the WordNet database in the category
in question. The negative examples are randomly drwan
from a dictionary. While the latter examples may be false
negatives, we consider the probability negligible. Per experiment we used a total of six anchors, three of which are
randomly drawn from the WordNet database category in
question, and three of which are drawn from the dictionary.
Subsequently, every example is converted to 6-dimensional
vectors using NGD . The ith entry of the vector is the
NGD between the ith anchor and the example concerned
(1 ≤ i ≤ 6). The SVM is trained on the resulting labeled
vectors. The kernel-width and error-cost parameters are
automatically determined using fi ve-fold cross validation.
Finally, testing of how well the SVM has learned the
classifi er is performed using 20 new examples in a balanced
ensemble of positive and negative examples obtained in
30
Accuracy Histogram
25
number of trials
English
plant
car
dance
speak
friend
20
15
10
5
0
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
accuracy
Fig. 10. Histogram of accuracies over 100 trials of WordNet experiment.
the same way, and converted to 6-dimensional vectors in
the same manner, as the training examples. This results
in an accuracy score of correctly classifi ed test examples.
We ran 100 experiments. The actual data are available
at [5]. A histogram of agreement accuracies is shown in
Figure 10. On average, our method turns out to agree
well with the WordNet semantic concordance made by
human experts. The mean of the accuracies of agreements is
0.8725. The variance is ≈ 0.01367, which gives a standard
deviation of ≈ 0.1169. Thus, it is rare to fi nd agreement less
than 75%. The total number of Google searches involved
in this randomized automatic trial is upper bounded by
100×70×6×3 = 126, 000. A considerable savings resulted
from the fact that we can re-use certain google counts. For
every new term, in computing its 6-dimensional vector, the
NGD computed with respect to the six anchors requires
the counts for the anchors which needs to be computed
only once for each experiment, the count of the new term
which can be computed once, and the count of the joint
occurrence of the new term and each of the six anchors,
which has to be computed in each case. Altogether, this
gives a total of 6 + 70 + 70 × 6 = 496 for every experiment,
so 49, 600 google searches for the entire trial.
It is conceivable that other scores instead of the NGD
used in the construction of 6-dimensional vectors work
competetively. Yet, something simple like “the number
of words used in common in their dictionary defi nition”
(Google indexes dictionaries too) is begging the question
and unlikely to be successful. In [26] the NCD abbroach,
compression of the literal objects, was compared with a
number of alternative approaches like the Euclidean distance between frequency vectors of blocks. The alternatives
gave results that were completely unacceptable. In the current setting, we can conceive of Euclidean vectors of word
frequencies in the set of pages corresponding to the search
term. Apart from the fact that Google does not support
automatical analysis of all pages reported for a search
term, it would be computationally infeasible to analyze the
millions of pages involved. Thus, a competetive nontrivial
alternative to compare the present technique against is an
interesting open question.
VI. Conclusion
A comparison can be made with the Cyc project [22]. Cyc,
a project of the commercial venture Cycorp, tries to create
artifi cial common sense. Cyc’s knowledge base consists of
hundreds of microtheories and hundreds of thousands of
terms, as well as over a million hand-crafted assertions
written in a formal language called CycL [30]. CycL is
an enhanced variety of fi rst-order predicate logic. This
knowledge base was created over the course of decades
by paid human experts. It is therefore of extremely high
quality. Google, on the other hand, is almost completely
unstructured, and offers only a primitive query capability
that is not nearly flexible enough to represent formal
deduction. But what it lacks in expressiveness Google
makes up for in size; Google has already indexed more
than eight billion pages and shows no signs of slowing
down.
Acknowledgment
We thank the referees and others for comments on presentation.
VII. Appendix: Relation to LSA
The basis assumption of Latent Semantic Analysis is that
“the cognitive similarity between any two words is reflected
in the way they co-occur in small subsamples of the language.” In particular, this is implemented by constructing a
matrix with rows labeled by the d documents involved, and
the columns labeled by the a attributes (words, phrases).
The entries are the number of times the column attribute
occurs in the row document. The entries are then processed
by taking the logarithm of the entry and dividing it by the
number of documents the attribute occurred in, or some
other normalizing function. This results in a sparse but
high-dimensional matrix A. A main feature of LSA is to
reduce the dimensionality of the matrix by projecting it into
an adequate subspace of lower dimension using singular
value decomposition A = UDV T where U,V are orthogonal
matrices and D is a diagonal matrix. The diagonal elements
λ1 , . . . , λ p (p = min{d, a}) satisfy λ1 ≥ · · · ≥ λ p , and the
closest matrix Ak of dimension k < Rank(A) in terms of
the so-called Frobenius norm is obtained by setting λi = 0
for i > k. Using Ak corresponds to using the most important
dimensions. Each attribute is now taken to correspond to
a column vector in Ak , and the similarity between two
attributes is usually taken to be the cosine between their
two vectors. To compare LSA to our proposed method,
the documents could be the web pages, the entries in
matrix A are the frequencies of a search terms in each web
page. This is then converted as above to obtain vectors
for each search term. Subsequently, the cosine between
vectors gives the similarity between the terms. LSA has
been used in a plethora of applications ranging from data
base query systems to synonymy answering systems in
TOEFL tests. Comparing its performance to our method
is problematic for several reasons. First, the numerical
quantity measuring the semantic distance between pairs
of terms cannot directly be compared, since they have
quite different epistimologies. Indirect comparison could be
given using the method as basis for a particular application,
and comparing accuracies. However, application of LSA in
terms of the web using Google is computationally out of
the question, because the matrix A would have 10 10 rows,
even if Google would report frequencies of occurrences in
web pages and identify the web pages properly. One would
need to retrieve the entire Google data base, which is many
terabytes. Moreover, as noted in Section I-D, each Google
search takes a signifi cant amount of time, and we cannot
automatically make more than a certain number of them
per day. An alternative interpretation by considering the
web as a single document makes the matrix A above into
a vector and appears to defeat the LSA process altogether.
Summarizing, the basic idea of our method is similar to
that of LSA in spirit. What is novel is that we can do it
with selected terms over a very large document collection,
whereas LSA involves matrix operations over a closed
collection of limited size, and hence is not possible to apply
in the web context.
References
[1] J.P. Bagrow, D. ben-Avraham, On the Google-fame of scientists and other populations, AIP Conference Proceedings
779:1(2005), 81–89.
[2] C.H. Bennett, P. Gács, M. Li, P.M.B. Vitányi, W. Zurek,
Information Distance, IEEE Trans. Information Theory,
44:4(1998), 1407–1423.
[3] C.H. Bennett, M. Li, B. Ma, Chain letters and evolutionary
histories, Scientific American, June 2003, 76–81.
[4] C.J.C. Burges. A tutorial on support vector machines for
pattern recognition, Data Mining and Knowledge Discovery,
2:2(1998),121–167.
[5] Automatic Meaning Discovery Using Google: 100
Experiments in Learning WordNet Categories, 2004,
http://www.cwi.nl/∼cilibrar/googlepaper/appendix.pdf
[6] R. Cilibrasi, Complearn Home, http://www.complearn.org/
[7] R. Cilibrasi, R. de Wolf, P. Vitanyi. Algorithmic clustering
of music based on string compression, Computer Music J.,
28:4(2004), 49-67.
[8] R. Cilibrasi, P. Vitanyi. Clustering by compression, IEEE
Trans. Information Theory, 51:4(2005), 1523- 1545.
[9] R. Cilibrasi, P. Vitanyi, Automatic meaning discovery using
Google, http://xxx.lanl.gov/abs/cs.CL/0412098 (2004).
[10] R.
Cilibrasi,
P.
Vitanyi,
A
New
Quartet
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
Tree
Heuristic
for
Hierarchical
Clustering,
http://www.cwi.nl/∼paulv/papers/quartet.pdf
P. Cimiano, S. Staab, Learning by Googling, SIGKDD
Explorations, 6:2(2004), 24–33.
T.M. Cover and J.A. Thomas, Elements of Information
Theory, Wiley, New York, 1991.
J.-P. Delahaye, Classer musiques, langues, images, textes et
genomes, Pour La Science, 317(March 2004), 98–103.
The
basics
of
Google
search,
http://www.google.com/help/basics.html.
L.G. Kraft, A device for quantizing, grouping and coding
amplitude modulated pulses. Master’s thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, Mass., 1949.
D. Graham-Rowe, A search for meaning, New Scientist, 29
January 2005, p.21.
Slashdot, From January 29, 2005: http://science.slashdot.org
/article.pl?sid=05/01/29/1815242tid=217tid=14
Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library
for support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm
P. Cimiano, S. Staab, Learning by googling, ACM SIGKDD
Explorations Newsletter, 6:2 (December 2004), 24 – 33
H. Muir, Software to unzip identity of unknown composers,
New Scientist, 12 April 2003.
K. Patch, Software sorts tunes, Technology Research News,
April 23/30, 2003.
D. B. Lenat. Cyc: A large-scale investment in knowledge
infrastructure, Comm. ACM, 38:11(1995),33–38.
F Keller, M Lapata, Using the web to obtain frequencies
for unseen bigrams, Computational Linguistics, 29:3(2003),
459–484.
A.N. Kolmogorov. Three approaches to the quantitative
defi nition of information, Problems Inform. Transmission,
1:1(1965), 1–7.
M. Li, J.H. Badger, X. Chen, S. Kwong, P. Kearney, and
H. Zhang, An information-based sequence distance and
its application to whole mitochondrial genome phylogeny,
Bioinformatics, 17:2(2001), 149–154.
M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi. The similarity
metric, Iaa EEE Trans. Information Theory, 50:12(2004),
3250- 3264.
M. Li, P. M. B. Vitanyi. An Introduction to Kolmogorov
Complexity and Its Applications, 2nd Ed., Springer-Verlag,
New York, 1997.
M. Li and P.M.B. Vitányi. Algorithmic Complexity, pp. 376–
382 in: International Encyclopedia of the Social & Behavioral Sciences, N.J. Smelser and P.B. Baltes, Eds., Pergamon, Oxford, 2001/2002.
M. Li and P.M.B. Vitányi, Reversibility and adiabatic computation: trading time and space for energy, Proc. Royal
Society of London, Series A, 452(1996), 769-789.
S. L. Reed, D. B. Lenat. Mapping ontologies into
cyc. Proc. AAAI Conference 2002 Workshop on Ontologies for the Semantic Web, Edmonton, Canada.
http://citeseer.nj.nec.com/509238.html
D.H. Rumsfeld, The digital revolution, originally
published June 9, 2001, following a European trip.
In: H. Seely, The Poetry of D.H. Rumsfeld, 2003,
http://slate.msn.com/id/2081042/
C. E. Shannon. A mathematical theory of communication.
Bell Systems Technical J., 27(1948), 379–423 and 623–656.
G.A. Miller et.al, WordNet, A Lexical Database for the
English Language, Cognitive Science Lab, Princeton University, http://www.cogsci.princeton.edu/ wn
[34] E. Terra and C. L. A. Clarke. Frequency Estimates for
Statistical Word Similarity Measures. HLT/NAACL 2003,
Edmonton, Alberta, May 2003. 37/162
[35] M.E. Lesk, Word-word associations in document retrieval
systems, American Documentation, 20:1(1969), 27–38.
[36] P.-N. Tan, V. Kumar, J. Srivastava, Selecting the right
interestingness measure for associating patterns. Proc. ACMSIGKDD Conf. Knowledge Discovery and Data Mining,
2002, 491–502.
[37] T. Landauer and S. Dumais, A solution to Plato’s problem:
The latent semantic analysis theory of acquisition, induction
and representation of knowledge, Psychol. Rev., 104(1997),
211–240.
[38] Corpus collosal: How well does the world wide web represent human language? The Economist, January 20, 2005.
http://www.economist.com/science
/displayStory.cfm?story id=3576374
VIII. Biographies of the Authors
RUDI C ILIBRASI received his B.S. with honors from
the California Institute of Technology in 1996. He has
programmed computers for over two decades, both in
academia, and industry with various companies in Silicon Valley, including Microsoft, in diverse areas such
as machine learning, data compression, process control,
VLSI design, computer graphics, computer security, and
networking protocols. He is now a PhD student at the
Centre for Mathematics and Computer Science (CWI) in
the Netherlands, and expects to receive his PhD soon
on the circle of ideas of which this paper is representative. He helped create the fi rst publicly downloadable
Normalized Compression/Google Distance software, and
is maintaining http://www.complearn.org now. Home page:
http://www.cwi.nl/∼cilibrar/
PAUL M.B. V IT ÁNYI is a Fellow of the Centre for
Mathematics and Computer Science (CWI) in Amsterdam
and is Professor of Computer Science at the University
of Amsterdam. He serves on the editorial boards of Distributed Computing (until 2003), Information Processing
Letters, Theory of Computing Systems, Parallel Processing
Letters, International journal of Foundations of Computer
Science, Journal of Computer and Systems Sciences (guest
editor), and elsewhere. He has worked on cellular automata, computational complexity, distributed and parallel
computing, machine learning and prediction, physics of
computation, Kolmogorov complexity, quantum computing.
Together with Ming Li they pioneered applications of
Kolmogorov complexity and co-authored “An Introduction
to Kolmogorov Complexity and its Applications,” SpringerVerlag, New York, 1993 (2nd Edition 1997), parts of which
have been translated into Chinese, Russian and Japanese.
Home page: http://www.cwi.nl/∼paulv/