S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R.harshman

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Inferring Web Communities from Link Topology

David Gibson Jon Kleinberg Prabhakar Raghavan


Dept. of Computer Science Dept. of Computer Science Almaden Resexch Center
UC Berkeley Cornell llniversity IBM
Herkeley, CA 94720 1JSA Ithaca, NY 14553 USA San Jose, CA 95120 IJSA
$1 510 643 5425 $1607 254 4636 $1408 927 1804
dag&x.berkeley.edu kleinberQcs.cornell.edu pragh@?almaden.ibm.com

ABSTRACT linked content,. Thus, while individuals can impose or-


The World Wide Web grows through a decentralized, der at an extremely local level, its global organization is
almost anarchic process, and this has resulted in a large utterly “unplanned” - in some sense, high-level struc-
hyperlinked corpus without the kind of logical organiza- ture can emerge only through a posteriori analysis.
tion that can be built into more tradit,ionally-created hy-
permedia. To extract, meaningful structure under such The link structure of the www represents a consider-
circumstances, we develop a notion of hyperlinked com- able amount of latent human annotation, and thus offers
munities on the www t,hrough an analysis of the link a promising starting point for structural studies of the
topology. By invoking a simple, mathematically clean Web. There has been a growing amount of work directed
method for defining and exposing the structure of these at the integrat,ion of textual content and link informa-
communities, we are able to derive a number of themes: tion for the purpose of organizing [2, 4, 13, 241, visualiz-
The communities can be viewed as containing a core of ing [7, 211 and searching [l, 5, 6, 14, 17, 19, 22, 23, 25,261
central, “authoritative” pages linked tog&her by “hub in hypermedia such as the WWW. The present work
pages” ; and they exhibit a natural type of hierarchical originat,es from the problem of searching on the WWW;
topic generalization that can be inferred directly from building on this, it attempts to deal explicitly with defm-
the pat,t,ern of linkage. Our investigation shows that al- ing a meaningful notion of structure in such an environ-
though the process by which users of the Web create ment, as a way of addressing issues such as navigation
pages and links is very difficult to understand at a “lo- and information discovery.
cal” level, it results in a much greater degree of orderly
high-level structure than has typically been assumed. Our emphasis here is on an invest,igation of the link
topology of the WWW, and some fairly pervasive themes
Keywords: Hypertext, communities, information explo-
we have identified about the structure of hypertextual
ration, World Wide Web, collaborative annotation.
communities that, have developed in the Web. We will
INTRODUCTION see that, this notion of a community provides a surpris-
As hyperlinked environments grow in size and complex- ingly clear perspective from which to view the seemingly
ity, discovering and representing meaningful high-level haphazard development of the Web’s infrastructure.
structure in them becomes an increasingly challenging
problem. This is particularly evident in the case of t)he The themes that. emerge are valuablt, in a number of rc-
World Wide Web (WWW), and the search for structure spects. Our analysis of the link structure of t,he www
here is compelling for several reasons. The www is a suggests that the on-going process of page creation and
hypertext corpus of enormous complexity, and it COII- linkage, while very difficult to understand at a local
tinues to expand at a phenomenal rate. Moreover, it’ level, results in structure that is considerably more or-
can be viewed as an intricate form of “populist hyper- derly than is typically assumed. Thus, it gives us a
media,” in which millions of on-line participants, many global understanding of the ways in which independent
wit,11 conflicting goals, are continuously creating hyper- users build connections to one another in hypermedia
that arises in a distributed fashion, and it provides a
Pe~idon to make d&d or h.wdcopiesof all or pb ofhis work for basis for predicting the way in which on-line communi-
Personalor ckiWoom useiSpanted without fee providedhat copies
are not madeor distributedfor profit or commercialadvantageand that ties in less comput,er-oriented disciplines will develop as
coW bear this notice and the full citation on the first page,~~ copy they become increasingly “wired.” It also suggests some
ofimk to republish,to post on serversor to redistributeto lists, of the types of structured, higher-level information that
rewires prior specificpermissionand/ora fee.
HmerText 98 Pittsburgh PA USA designc>rs of information discovery tools may be able to
CoPyright ACM 1998 o-89791-972--61981 G...$~.oo provide bot,h for, and about., user populations on the
Web.

225
Our study is based on experience with a hyperlink-oriented important point: our use of the term cornrnvn~dy does
method for searching introduced by Kleinberg in [17], not imply that these structures have been construct,ed
and with HITS (Hyperlink-Induced Topic Search), an ex- in a centralized or planned fashion. Rather, our exper-
pr,rimental system built around this technique. The uii- imenta,tion wit,h HITS shows that such communit,ies of
dr,rlying technique is discussed in detail in [l’i]. Here we hubs and authorities ure a recurrang consequence of thr
invoke this technique, developing it for our study of com- way in which creators of pages on the www lmk to 0116
munities on t,he Web. We begin with a brief summary nnother m the contczt of topics of widespread intcrcst.
(in the following section) of the main concepts from [I71
OVERVIEW OF HITS
that are necessary for understanding our study. Follow-
We begin wit,h a summary of the main concept’s from
ing this, t,he bulk of the paper is then a discussion of the
[17] t,hat, are necessary for underst,anding the presc,nt#
methodology and basic motivation underlying our in-
work.
vestigation of www communities, and the main themes
that have emerged.

HITS is concerned wit.h the identification of a&haritatane


hypermedia sources for broad-topic information discov-
er’y. We briefly illustrate these notions with a set of
examples. Consider, on the one hand, an individual
who wishes to use the Web to find the phone number
of a friend who is a student at Harvard Law School.
The desired information is likely to be contained on
only one or two pages, and hence the main issue is
to locate this small number of re1evan.t pages. On the
other hand, consider an individual who wishes to use
the Web to find information about Harvard University.
Since there are more than 800,000 pages on the www
containing the term “Harvard,” a shortage of relevant
1. St#arting from a user-supplied query, HITS assembles a
pages is no longer t,he problem. Rather, for a broad-topic
root sel S of pages: typically, up to 200 pages returned
search such as this, the user requires a way of identify-
by a search engine such as AltaVista [9] on that query. It
ing a small collection of the most central, or auth,ori-
then expands this to a larger base set T by adding in any
tatzve, pages on the topic “Harvard.” Most standard
pages that point to, or are pointed to by, any page in 5’.
search engines do not, for example, return authoritative
(To prevent, the size of T from exploding, only a fixed-
pages such as www . harvard. edu. The int,eresting ques-
size subset of the pages pointing to, or pointed to by, any
tion arising in this context, is: how can one determine,
single page in S is considered.) See the accompanying
without human intervrntion, that www . harvard. edu is
figure, in which we depict the relationship of the root)
indeed a page that should be considered authoratulive
SPI, S and the base set, ‘1’.
for the topic “Harvard”?
2. We associate with each page p a hub roeight h(p) and
The technique underlying HITS stems from t,wo premises an authority ,weight a(p), all initialized to 1. Let p -+ q
[17]: first. t)hat the implicit annotation provided by hu- den&r “page p has a hypctrlink to page q”. HI’I’S then
man creators of’ hyperlinks cont,ains sufficient informa- iterat,ively updates the h’s and CZ’Sas f’ollows:
tion to infer a notion of “authority”; and second, build-
ing on t,his, that sufficiently
bedded communities
broad topics contain em-
of hyperlinked pages. We view
U(P)
:=c h(Y),
4-V
such communities as containing two distinct, but inter-
relat#ed, types of pages: uuthorzties (highly-referenced h(P)
:=c 4Y).
P-q
pages) on the t.opic, as well as numerous pages t,hat
“point” to mauy of the authorities, and thus serve to Thus, a single iterat,ion replaces c&(y) by the swn of the
“pull” t,hcm t,oget,her. We refer t,o pages of the latter 110’s of pages pointing to p, and then replaces h(p) by
type as “hubs,” since they serve as st,rong central points t#hr, sum of the no’s of pages pointed to by p.
from which authority is “conferred” on relevant pages.
Hubs and authorities exhibit what could be called a 3. The updating operations are performed for all t)he pages,
mut’ually reinforcing relationship: a good hub point,s to aud t,ht, process is rep’eat,ed (normalizing the weights af-
many good authorities; a good uclthority is pointed to by ter each it.eration). It can he proved that this iterative
many good hubs. We break this apparent, circularity by proct’ss couverges t,o stable sets of authoritjy and hub
au it,erxtivr method described in t,he next, s&ion. An weights [l7]. We declare t,he 10 pages wil,h i,he highest,

226
a() values together with the 10 pages wit’h the highest ingful communities as well [17]. For example, HITS also
Ir() values to be the core of a colnmunity. (The number identifies, in the base set for the topic “Harvard”, a large
10 here is more or less arbitrary, and is not crucial for community of pages on bio-medical topics, drawn into
the following discussion; essentially, we wish our corn- the base set because of the strong linkage between these
munities t,o have a size that is “manageable” for human pages and the many biological and medical labs associ-
users.) ated with Harvard. This is a case in which the main
community can be considered “on-topic,” with several
In fact) it can brs proved t)hat the equilibrium values of the smaller “off-topic” communities arising as well.
hub weights and authority weights correspond to coor-
It turns out that the matrices MhUb and fifaUth discussed
tlinatjes in the principal eigenvectors of a pair of matrices
above contain enough information to allow for the dis-
Ahuh and MaLltb, derived from the link structure [17].
covery of such additional communities. Recall that the
Vote that, the met,hod is ext,remely simple a.nd mathe- basic algorit#hm for identifying a community could be
mat#ically clean: one can analyze its convergence prop- analyzed in terms of the principal eigenvectors of Mtl&
erties in a rigorous fashion, and the only tunable pa- and MaUth. It turns out that the first few non-principal
rameter is the procedure for fixing the root set. We feel eigenvectors of bft,,,b and M&h have the same intu-
t,hat this makes the t,echnique an a.ppealing framework itive meaning as the principal eigenvector: they rep-
in which to search for inherent st,ructure in Web com- resent, pairs of weight assignments exhibiting the mutu-
munities. The fact, that the method is designed to run ally reinforcing relationship of hubs and authorities [17].
on an arbitrary link st,ructure, wit#hout fine-tuning or Thus, by computing several of the non-principal eigen-
t,hc incorporation of expert knowledge about the WWW, vectors of MhUb and MaUth, HITS can discover additional
suggests t,hat the structural observations that emerge communities within the sarne base set of linked pages.
are largely intrinsic to the Web itself, rather than an We will use the term principal community to refer to
artifact of an “over-t,rained” algorithm. the community found by the basic algorithm, associated
With the principal eigenvectors of MhUb and Mauth; it is
We return to our initia.1 example, the topic “Harvard”, intuitively the community exhibiting the densest pat-
i,o illust,rate these notions in a concrete setting. The top tern of linkage between hubs and authorities. The addi-
aut.horities for t,his topic, as generated by HITS are the tional communities associated with non-principal eigen-
following. (‘Thr- root set, consisted of 200 pages returned vectors will be called non-principal communities. Since
by hltavista ou t#hr query “harvard”.) the non-principal eigenvectors of MhUb and Mauth can
be naturally ordered (by the magnitude of their associ-
www.harvard.edu ated eigenvalues), this induces a natural ordering on the
www.hbs.harvard.edu non-principal communities as well.
www.law.harvard.edu
ksgweb, harvard . edu Multiple communities can also form in the base set be-
cause a query t,erm has several meanings in different
The top hubs in this case consist of pages created by contexts; for example, the topic “geometry” produces
various individuals not necessarily located at Harvard, communities on computational geometry, differential ge-
consisl.ing of links t,o a large number of the t,op author- ometry, and seismic geometry.
i&s. Many other examples of a similar nature, for a
RELATED WORK
range of topics, can be found in [17].
The use of eigenvect,ors for the purposes of partitioning
Note the crucial fact that the textual conten of the a graph was introduced by Donath and Hoffman in [lo]
pages involved is only considered in the initial step, and has been studied extensively since. The method un-
when a root sel is assembled from a search engine. Fol- derlying HITS is technically distinct from that of [lo] (it
lowing this, the algorithm simply propagates weight over does not partition the “Web graph”), though the heuris-
links wii,hout further regard to the relevance of the pages tic intuition underlying both is clearly quite similar. For
it is working with. The fact that ~II’IX can reliably iden- non-hyperlinked corpora, an information retrieval tech-
tify pages t,tiat are not, only n?Liho~~tative but, also rele- nique known as latent semantic indexing [8] makes use
vant t,o t,tie user’s initial query implies something about of the singular vectors of a matrix derived from the in-
t,he breadth of the topic: since the initial root. set, was verted index of t,he corpus. HITS, on the ot,her hand, is
sufficirntjly rich in relevaut, pages; Ill? densest, COII~I~III- operat,ing purely on the link structure of a hyperlinked
nity of hubs and authorities in the surrounding base set corpus, and ma.kcs no use of matrices with term weights.
was relevant as well.
The use of link information t,o improve search perfor-
In general, of course, the base set, cont.ains not, only this mance on the www has been advanced in previous work;
densest comniunit,y but a large number of other mean- hyperlink analysis has been used for enhancing relevance

227
,juclg~ncr~t,s[I, 12, 14. 19, 261, as well as for “ranking” unity. Thus the top authorities for the topic “Harvard”
www pages [5, 22, 23, 171. Link structures have been after a single iteration consist of a mixture of pages for
studied in hypertext research that predates the WWW; in schools at Harvard, pages on bio-medical topics, and t,he
particular, Hot,afogo et al. b1] m t ro( I uce graph-theoretic home pages of YAHOO and Microsoft. As we increase the
measures based on link density and node-to-node dis- number of iterations, we are able to watch the base set
tances for clustering and searching in hypermedia. Their “resolve” itself int,o coherent communities of hubs and
notions of zndex and reference nodes bear a relation to aut,liorities.
the notions of hubs and authorit#ies used here; however,
Issues (1) and (2) are discussed in detail below. To
they are based purely on the out,- and in-degrees of indi-
study issue (3) quantitatively, we define the principal
vidual documentjs in the hyperlinked environment. (See
[I71 for a discussion of some of the difficulties in apply- community, as before, to be the set C of the top 10 au-
thorities and top 10 hubs .-- 20 pages in all. Recall tha.t
ing pure degree-counting methods to a domain on the
this set C is implicitly a function of one parameter R,
scale of the WWW.)
the root set size. However to study the “convergence”
The field of bibliometrics studies the patterns of cita- to C, we include N, the number of iterations, as sec-
tion - an implicit t,ype of “linkage” - among scient,ific ond parameter; so we say that C(R, N) is the principal
papers. See [27] f or a review. A number of their mea- communit,y obtained by running HITS for N iterations
sures have meaning in the context of hypermedia; some starting from a root set size of R. In our experience
of t,hese connections are studied in [18]. One can also with several hundred HITS trials on a variety of topics,
interpret the behavior of HITS as relying on a type of t,he communities that form predictably become stable
communzty nremory, as st,udied by Marshall et al. [20]. with a root set size of 200 and after 50 iterations; so
In essence, III’TS is searching for a particular kind of link we define C” = C(200,50). Now, for various values of
st,ructure reinforcing hubs and authorities - and this R and N, we can look at the overlap between C(R, N)
structure is crratcd by communities of people collating and G* ~~~the number of pages they have in common.
useful informat,ion, whether for their own benefit or for In Figures l-6, on the final page, we plot, these over-
ot,hers’. laps for N = 1,3,10,50 and R = 25,50,100,200, for six
representative topics in our study:
METHODOLOGY AND OBSERVATIONS
Harvard
We now discuss our use of HITS to study the emergence
cryptography
of communities on the Web; we also provide a high-level
English literature
summary of our rnain observations. In the next section
skiing
we discuss t,hese observations in greater detail.
optimization
We focus ofi t,hree issues: operations research
(1) What are t,hf, (principal and non-principal) commu-
nities discovered by MIX+? (Each curve in these plots measures overlap wit#h respect
(2) How do the communities discovered depend on the to root set size, for a fixed value of N. Again, N = 50
choice of root, sr%? We report. results on two kinds of essentially represents convergence in our case.)
variations of the root, SC%: (i) assembling t,he root set
Note t,hat our represent-ative topics have different lev-
frolrl different. sources (Alt,aVist.a vs InfoSeek), or from
els of connection to the broad area of computer science,
a query issued in different languages (“astrophysics”
ranging from topics that are heavily computer-orient.ed
vs “astrophysique”); (ii) varying the size of the root
(“cryptography”) to entirely different academic disci-
set: we consider root, sets composed of the top 25, 50,
plines (“English literature”). Such topics also ex-
100 and 200 results returned by AltaVist,a. This has the
hibit a range of different patterns and quantity of link-
effect, of “focusing” t,lic text query to varying extents;
age, and this has an effect on the rate at which the prin-
we can then analyze the communities that result.
cipal community “crystallizes,” in terms of the number
(3) How quickly do the communities crystallize as the
of iterations and t,he root set size. These contrasts will
number of it.erations grows? (Recall that ea.ch it,erat,ion
be discussed further below. The plots in In Figures 1-6
updates the hub and aut,hority weights in terms of one
also illustrate clearly the danger of using only a single
another.) Are most, communities tight,ly-knit, (emerging
iterat,ion.
after only a small number of iterations), or does it take
many iterat,ions for them to take shape? For instance We now sumrnarize some principal themes that emerge
after a single iteration, the t,op aut,horitirs are sirnply from our analysis. We find that communities on suffi-
the pages in t,lie base set wit,h the largest numbr~r of cient,ly broad topics tend to have a fairly “robust” struc-
incoming links. While t,hese pagrs are indeed highly ref- ture: the groupings of pages discovered tend to be rel-
c>rcnced (by definition), they tend tjo tack any thematic atively independent, of t,he exact choice of root set. We

228
also find that t,he success of HITS depends on both the these experiments, regardless of how the root set is con-
hreudfh of a t,opic and the discipline of human knowl- st,ructed. However, since communities have more or less
cadge under which it falls. This is because t,he density representation in different base sets, the identity of the
and conlpreheItsiv~‘nPss o f hyperlinking is far greater in prznczpal community is not always the same. But what
some disciplines (e.g., computer science) than in others t.his suggests is that multiple experiments with different
(e.g., subjects in the academic humanities, which are root, sets are providing us with multiple, slightly altered
moving onto the Web more slowly.) On topics which do views of a small set of underlying communities, which
not have a sufficient density of hyperlinking, HITS tends are being sampled by the various choices of root sets.
to find communities that generalize the initial topic in
To illustrate this point, consider the way in which the
one of several senses; this will be another focus of our
principal authorities for the topic “astrophysics” recur
discussion below.
in non-principal communities for the French and Ger-
Note that a fairly counter-intuitive point1 has been emerg- man versions of the topic. The top 5 authorit,ies for
ing from the development, above. Specifically: The great- “astrophysics” are
est degree of orderly structurq as extracted by HITS, rs
found zn communities for zuhzch the number of relevant fits.cv.nrao.edu/www/astronomy.html
pages, and the denszty of hyperlinking, is the largest. WC cdsweb.u-strasbg.fr/Simbad.html
have seen this phenomenon with “cryptography” and www.aas.org/
“English literature”. This is in contrast to the stan- heasarc.gsfc.nasa.gov
dard point of view that, the www is becoming increas- adsabs.harvard.edu/abstract-service.html
ingly “chaotic” and difficult to model; it suggests that
t,he technique underlying HITS is actually becoming more
For “astrophysique”, the top 5 authorities in the sth
effective as the size of the Web continues to increase.
non-principal community are
A consequence of this is that we can use our experi-
ence wit,h highly-linked topics (e.g. “cryptography”) to
make predictive statements about the struct#ure of com- cdsweb.u-strasbg.fr/CDS.html
munities that have yet, to fully “catch up” in the setting adswww.harvard.edu/
of t#he WWW. cdsweb.u-strasbg.fr/Simbad.html
adsabs.harvard.edu/abstract-service.html
The power of HITS on highly-linked communities is akin fits.cv.nrao.edu/www/astronomy.html
to a “law of large numbers”: many independent, largely
random annot,ations will reinforce one another and meta- For “astrophysik”, the top 5 authorities in the 7th non-
rnorphose into a broad-topic community replete with principal community are
structure. We note that the emergence of regular struc-
ture in random networks is an active topic of research
adswww.harvard.edu/
in combinatorics (see e.g. [3]), and we feel that, it would
cdsweb.u-strasbg.fr/Simbad.html
be interc,sting to investigate such connections further in
adsabs.harvard.edu/abstract-service.html
the context of BITS arid the WWW.
aibn55.astro.uni-bonn.de:8000/
THE STRUCTURE OF COMMUNITIES www.univ-rennesi.fr/ASTRO/astro.english.html
We now detail our main findings on the st,ructure of
communities. Topic Generalization. There is no sharp boundary be-
tween those topics that are “broad” and those that aren’t;
Robustness. For broad topics, HITS produces stable, but one of the primary themes that emerges from our
robust communities despite starting from a very small experience is that HITS tends to “generalize” topics that
sample of relevant, pages in the initial root set. We have are not sufficiently broad. By this we mean that the
explored t,his by several direct methods, providing HITS principal community of hubs and authorities will be rel-
with a variet#y of different root, sets relevant to the same evant, to a topic which includes, but is larger than, the
tropic. For exaruplr, we issue the same query st,ring to initial t,opic provided to HITS.
multiple search engines (e.g. AltaVista [Y], Infoseek [lo;],
and Excite [ll]); this typically produces root setIs with To make this concrete, we focus on three basic examples.
very lit,tle intersection. Similarly, one can obtain root consider first the notion of topics defined by proper
sets that are nearly disjoint by issuing a query term names. The t,opic “Michael Jordan” (the basketball
in several different1 languages (e.g., “astrophysics” vs star) turns out) to be sufficiently broad: there are ntl-
“astrophysik” vs “astrophysique”). merous hub pages containing links to pages on Michael
Jordan and his team, and hence the principal commu-
We find that, the main communities tend to recur in all nity is highly relevant. On the other hand, the topic

229
“Dennis Ritchie” (an author of the C programming www.crs4.it/HTML/Literature.html
language) produces the following top 3 authorities: humanities.uchicago.edu/ARTFL/ARTFL.html
un2sgi.unige.ch/www/athena/html/francaut.html
www.cm.cf.ac.uk/Dave/C/CE.html
www.cyberdiem.com/vin/leam.html At, the highest level, an examination of the two link
www.lysator.liu.se/c/index.html structures indicates that the density of linkage is much
greater for pages on the former topic than for those on
All of these are highly-referenced pages on the C pro- the latter, and this leads to their different behavior. We
gramming language itself. Thus, while the principal shall have more to say about this contrast below.
community is relevant to a generalzzatzon of the topic
Generalization is an int,eresting feature of HITS, since
“Dennis Ritchie,” the individual himself has been swal-
lowed up by the subject to which he most prominently it allows for the automatic characterization of certain
belongs. specific topics in terms of their generalizations. For ex-
ample, it was purely through an analysis of the link
One sees a sense in which the mechanism of HITS pushes structure that HITS placed the topic “Dennis Flitchie”
specific topics “upwards” in an implicit topic hierarchy, in a community on the C programming language.
while sufficiently broad t)opics represent fixed points on
which HITS finds relevant authorities that are not overly Convergent Generalization and a “Tree of Topics.” The
general. Consider, for example, the topic “optimization” implicit topic hierarchy just discussed can be explored
The top three authorities are the home pages of the In- more fully through further experimentation with HITS.
stitute for Operations Research and the Management It is natural to picture the process of abstraction and
Sciences, the CMU Graduate School of Industrial Ad- generalization as occurring on a tree of topics: the most
ministration, and the Society for Industrial and Applied general topics (e.g. Science, Art, Recreation) are closest
Mathemat,ics. to the root, and their descendants represent sub-topics.
Such an idea has been realized on the WWW, by human
www.informs.org ontologists, through the construction of searchable hier-
mat.gsia.cmu.edu archies such as YAHOO. A searchable hierarchy includes
hand-annotat,ion of the various topics (e.g. Science and
www.siam.org
Art); however, even through a purely automated anal-
ysis of the links, we can gain information about such
An examination of these and other authorities might
trees of topics.
suggest that the community is in fact relevant to a larger
topic that includes much of the field of optimization The hypothesis of an underlying tree-based explanation
~-- namely, operations research. This possibility is sup- for generalization is supported by the following phe-
ported by finding the principal community for the t#opic nomenon of convergent generalizataon, which HITS ex-
“operations research”; the pages discovered have sig- hibits. We have found that, given a broad topic on
nificant overlap with those for “optimization”. The which the technique does not generalize, one often dis-
top three authorities are covers very similar communities by applying HITS to a
range of more specific sub-topics.
www.informs.org
mat.gsia.cmu.edu One clear example is provided by the topic “cryptograph y”.
www.gams.com This topic is sufficiently general that HITS reliably finds
very focused authorities and hub pages. Now, suppose
Finally, it is interesting that topics which would nat- we choose more specific sub-topics of cryptography -~
urally be considered “parallel” can behave differently specifically, names of individual cryptographers. One,
with respect t,o this type of generalization. This can be finds that most of these sub-topics are generalized in
due to differences in the amount of hyperlinking among very similar ways, to communities that have significant
the relevant pages. For example, the top authorities for overlap both with each other and with the larger topic
“English literature” remain focused: “cryptography”.

Intuitively, we are looking at, a portion of the t,opic tree


the-tech.mit.edu/Shakespeare/works.html
in the vicinity of the node for “cryptography”; and we
www.english.upenn.edu/-jlynch/Lit
find that many of t,he child nodes generalize upward to
humanitas.ucsb.edu/shuttle/eng-mod.html
this common parent.

The top authorities for “German literature”, on the Other Factors Affecting Generalization. In addition to
ot,her hand, are relevant to a range of t.opics in European the natural notion of generalization discussed above, ~(1
lit,erature more generally. have ident,ified a number of ot,her factors t,hat cause HITS

230
to converge to communities that are not completely fo- pdg.lbl.gov/
cused on the initial topic. These factors recur in a num-
ber of different contexts, and again seem to highlight One can witness this phenomenon at work in t,he con-
some fundamental themes about, the structure of hyper- trast between “English literature” and
linked communities on the Web, and the underlying in- “German literature”. In particular, the greater link
terests of user populations in t,his domain. In particular, density of the former topic can be partially explained
we will discuss the following two factors: “Web-centric” by a number of influences, including the increased rep-
sub-topics, and commercialization. resentation of North American university home pages,
The first of these is the notion that, ultimately, what which tend t,o be highly linked, and the fact that the
determines the “generality” of a topic in this setting is creation of Shakespeare repositories is a much more de-
its representation on the WWW. Thus, certain topics veloped enterprise on the www than is the comparable
can seem artificially broad, and others artificially nar- activity for German authors. In the other direction, we
row, because they are of more or less interest to creators find it quite interesting that a large community associ-
of Web pages. This provides another perspective from ated with the output of “German literature” was in
which to view the notion that HITS tends to be most fact focused around Scandinavian literature; this relates
effective within topics that have the most “wired” com- closely to the numerous anecdotal observations about
munities. the unusually high density of linking among Web pages
from Finland, Sweden, and Norway.
The simplest manifestation of this principle is the fol-
lowing: in a large fraction of cases, the topic that Web One way to view this general issue (though it is an over-
pages are most concerned with involves the Web itself; simplification): at the root of our conceptual tree of
and this can influence, more or less subtly, the structure topics sits the World Wide Web itself.
of the communities that HITS discovers. In particular, The other main influence we touch on here is that of
the principal community can at first appear to be a spe- commercial/advertising influences on the link structure.
cialization, rather than a generalization, of the initial The process of page creation on the www has many
topic; but in reality, HITS has focused on this commu- simultaneous participants, and some of these can bring
nity because it represents a more “Web-centric” version many resources to bear on engineering the link struct,ure
of the topic. in a way that favors them. Thus, for topics with both
We consider three examples. For the topic “linguistics” commercial and individual involvement, the authorities
the top authorities are in the principal community will overwhelmingly tend to
be highly-commercialized pages. A simple example is
the topic “tennis”:
www.cs.columbia.edu/-acl/home.html
www.cs.columbia.edu/-radev/cgi-bin/universe.cgi
www.ling.rochester.edu/linglinks.html www.tennisserver.com/
www.uspta.org/
espnet.sportszone.com/ten/
The first two of these are strong authorities for the field
of computational linguistics. While this is a sub-topic
of the initial query, HITS has converged to it because However, we stress that it is considerably harder to mis-
of the same influences that typically cause it to gen- lead HITS by “spamming” than to mislead a t,erm-based
eralize: in the setting of the www it is a topic with search engine: HITS measures the collect8ive authority
a considerably greater density of linkage than classical vested in a resource by many people; in contrast, it is
linguistics. Next, we consider the topic “Harvard” as of common to for creators of popular pages to try “boost-
January 1997. Included among the top authorities are ing” their ranking on term-based engines by including
home pages for the Harvard Conference on the Inter- large numbers of judiciously-chosen keywords. (As a
net and Society, a very large Internet conference held in pragmatic matter, one way to limit the self-conferral of
May 1996. From the perspective of pages on the WWW, authority is to limit - or even ignore - the propagation
this is a prominent topic closely related to t,he initial of weights along links into a page from pages in its own
query, “Harvard”. Finally, an observation which should internet sub-domain.)
be obvious in hindsight: The top authority for the topic
A phenomenon which combines both t,hese influences,
“physics” is CERN, the location where the Web was
and which shows up quite frequently, is t,he recurring
in a sense “born”:
presence of pages such as AltaVista. YAHOO, and
www.microsoft . corn within a community of hubs and
www.cern.ch/ aut,horit,ies, regardless of t,he topic. The point is that
aps . erg/ these pages are linked to so heavily, from essentlially all

231
port,ions of the WWW, that they frequently show up in ing from a root set of pages in the a-step vicinity of
lists of authorit,ies. (Thus, for example, YAHOO was the www. nytimes . corn, the principal authorities consist of a
ninth-ranked authority for the topic “physics”.) We mixture of on-line news organizations and popular In-
ment,ion this issue because it tells us something about ternet sites. The first two non-principal communities
the extent to which such pages have infiltrated hyper- separate this mixture very sharply:
linked communities in all disciplines; as a pragmatic is-
sue, they could be removed from the output of HITS Authorities in first non-pr incipal commun ity :
by essentially treating them as the WWw-equivalent of www.microsoft.com/
“stop words” in information ret,rieval.
www.ibm.com/
www.apple.com/
Temporal Issues. Comparing the principal authorities
www.hp.com/
for “Harvard”, August 1997, and “Harvard”, January
www.sun.com/
1997, brings up an additional theme concerning hyper-
linked communities. There can often be substantial
Authorities in second non-principal community:
short-term factors that temporarily influence the set of
www.nytimes.com/
principal authorities for a topic; in the above exam-
ple, the Harvard Conference on the Internet and Society www.usatoday.com/
www.cnn.com/
was still a prominent feature of the topic “Harvard” in
www.sjmercury.com/
January 1997, but not in August 1997. Another exam-
www.chicago.tribune.com/
ple is the topic “comets”; in May 1997 HITS discovered
a large community concerned with alien visits and the
“Heaven’s Gate” group. FURTHER DIRECTIONS
The dynamic growth of the Web immediately suggests
Short-term influences die out, as pages and links are re-
that the type of analysis we have performed should be
moved over time; though they can be artificially kept repeated over time and the results compared. Such an
“current” by their inclusion in the indices of search en- approach provides a way of studying the temporal evolu-
gines. An interesting method for obtaining the long- tion of communities on the Web, and of understanding
term “core” of a topic is to superimpose the results of how the techniques considered here will adapt as the
HITS on the same topic, spaced out over several-month
Web continues to grow in both size and complexity.
periods.
We are currently investigating ways of using HITS to
Dissecting the Influences on a Topic. Earlier, we indi- improve performance on information retrieval tasks by
cated that the technique underlying HITS, through the combining text and the structure of hyperlinks; for work
use of eigenvectors, is able to discover multiple com- in this direction, see [6]. One goal here is to create a
munities associated with a given topic: a single prin- client-side information discovery system whose search
cipal community, together with an arbitrary number of parameters - at both the text-based and link-based
sparser non-przncipal communities. We find that typi- levels - are fully “tunable” by individual users,
cally the influences leading to topic generalization, in the
forms discussed above, a.re concentrated in some but not CONCLUSION
all of the communities discovered by HITS; and that some The www has grown into a hypertext environment of
of the non-principal communities are more “purely” fo- enormous complexity; and the process underlying its
cused on the search topic. By examining all of the growth has been driven in a chaotic fashion by the indi-
strongest communities together, one assembles a par- vidual actions of numerous participants. Our experience
t,ially nested, partially overlapping arrangement of the with HITS suggests, however, that in many respects the
different hyperlinked sub-communities that make up the end product is not as chaotic as one might suppose: the
search topic. For example, in the topic “physics”, one aggregate behavior of user populations on the www can
finds strong communities whose top authorities are com- be studied through a mathematically clean technique for
posed e&rely of (i) academic departments, (ii) high- analyzing the Web’s link topology, and one can use this
energy accelerators and colliders, and (iii) professional technique to identify themes about hyperlinked commu-
soci&ies. For the topic “Harvard” (Aug. 1997), the nities that appear to span a wide range of interests and
group of bio-medical pages can likewise be easily iden- disciplines.
tified as a coherent non-principal community, separate
from the home pages of schools at Harvard. ACKNOWLEDGEMENTS
We have benefited from discussions with many people on
This type of dissection can also be an effective way to the t,opics covered here. In particular, we thank R.akesh
separate out t,he “Web-centric” influences on a topic. Agrawal, Soumen Chakrabarti, Byron Dom, Nimrod Megidd
An example from [17] s h ows this very cleanly: start- Christos Papadimitriou, Sridhar R.ajagopalan, Ted Selker

232
and Eli Upfal. The work of the first and second authors Proc. 8th, ACiM Conference on Hypertext, 67-74,
was performed in part while visiting the IBM Almaden 1997.
Research Center. The second author is supported in 15. G. Golub, C.F. Van Loan, Matrix Computations,
part by an Alfred P. Sloan Research Fellowship and by Johns Hopkins University Press, 1989.
NSF Faculty Early Career Development Award CCR- 16. Infoseek Corporation, Infoseek search engine,
9701399. www. infoseek. corn.
REFERENCES 17 J. Kleinberg, “Authoritative sources in a hyper-
1. G.O. Arocena, A.O. Mendelzon, G.A. Mihaila, linked environment,” Proc. ACM-SIAM Sympo-
“Applications of a Web query language,” Proc. 6th sium on Discrete Algorithms, 1998. Also appears as
international World Wide Web Conference, 1997. IBM Research Report RJ 10076(91892) May 1997,
and atwww.cs.cornell.edu/home/kleinber/.
2. M. Q Wang Baldonado, T. Winograd, “Sense-
Maker: An information-exploration interface sup- 18 R. Larson, “Bibliometrics of the World Wide Web:
porting the contextual evaluation of a user’s inter- An exploratory analysis of the intellectual structure
ests,” Proc. ACM SIGCHI Conference on Human of cyberspace,” Ann. Meeting of the American Sot.
Factors in Computing, 1997. Info. Sci., 1996.
3. l3. Bollobbs, Random Graphs, Academic Press, 19. M. Marchiori, “The quest for correct information
1985. on the Web: Hyper search engines,” Proc. 6th In-
4. R. Botafogo, E. Rivlin, B. Shneiderman, “Struc- ternational World Wide Web Conference, 1997.
tural analysis of hypertext: Identifying hierar- 20. C. C. Marshall, F. M. Shipman III, R. J. McCall,
chies and useful metrics,” ACM Trans. Inf. Sys., “Putting Digital Libraries to Work: Issues from Ex-
10(1992), pp. 142-180. perience with Community Memories”, Proc. First
Annual Conference on the Theory and Practice of
5. d. Carrikre, R. Kazman, “WebQuery: Searching
and visualizing the Web through connectivity,” Digital Libraries, 1994.
Proc. 6th Internatzonal World Wide Web Confer- 21. S. Mukherjea and Y. Hara. Focus+Context Views
ence, 1997. of World-Wide Web Nodes. Proc. 8th ACM Con-
6. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, ference on Hypertext, 187-196, 1997.
P. Raghavan, S. Rajagopalan. “Automatic resource 22. L. Page, “PageRank: Bringing order to the Web,”
list compilation by analyzing hyperlink structure Stanford Digital Libraries working paper 1997-
and associated text.” Proc. 7th International World 0072.
Wide Web Conference, 1998. 23 L. Page, S. Brin, R. Motwani, T. Winograd, “The
7. C. Chen. Structuring and visualizing the WWW PageRank citation ranking: Bringing order to the
by gencralised similarity analysis. Proc. 8th ACM Web,” submitted for publication.
Conference on Hypertext, 177-186, 1997. 24 P. Pirolli, J. Pitkow, R. Rao, “Silk from a sow’s
8. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, ear: Extracting usable structures from the Web,”
R. Harshman, “Indexing by latent semantic analy- Proc. ACM SIGCHI Conference on Human Factors
sis,” J. American Sot. Info. Sci., 41(1990). in Computing, 1996.
9. Digital Equiprnent Corporation, Alta Vista search 25 E. Spertus, “Parasite: Mining structural informa-
engine, altavista.digital. corn/. tion on the Web,” Proc. 6th International World
10. W.E. Donath, A.J. Hoffman, “Algorithms for par- Wide Web Conference, 1997.
titioning of graphs and computer logic based on 26 R. Weiss, B. Velez, M. Sheldon, C. Nemprempre, P.
eigenvectors of connections matrices,” IBM Tech- Szilagyi, D.K. Gifford, “HyPursuit: A Hierarchical
nical Dzsclosure Bulletin, 15( 1972). Network Search Engine that Exploits Content-Link
11. Excite Inc., Exczte, www. excite. corn.
Hypertext Clustering,” Proceedings of the Seventh
ACM Conference on Hypertext, 1996.
12. M.E. Frisse, “Searching for information in a hyper-
text medical handbook,” Communicatrons of the 27. K.D. White, K.W. McCain, “Bibliometrics,” in
ACM, :31(7), pp. 880-886. Ann. Rev. Info. Sci. and Technology, Elsevier.
1989. pp. 119-186.
13. R. Furuta. F. M. Shipman III, C!. C. Marshall,
D. Brenner and II-W. Hsieh. Hypertext paths and 28. Yahoo! Corp. Yahoo!, www . yahoo . corn.
the world-wide web: experiences with Walden’s
paths. Proc. 8th ACM Conference 071 Hypertext,
lG7- 176, 1997.
14. G. Golovchinsky. What, the query t.old the link: the
integrat)ion of Hypertext and Information Retrieval.

233
5 I I
2s so 100 200 25 SO loo 20
Koot set Sl7x Root set sue

Figure 1. “Harvard” Figure 2. “Cryptography”

I I
I 1termLm - I iteration -
tx 3 iterations --- 3 iterations ---
: IO tterations -- IO iterations
2 50 tterations .-..-.. SO iterattons ......
20 - . .. . .
8 20 - ,..;-:;:;,-
= .. ..j..
2 IS -
$5 IS -
B IO -
;:
Y
0

5 - I I t t
25 SO 100 200 2.5 SO 100 200
Root set size Root set sue

Figure 3. “English Literature” Figure 4. “Skiing”

1 1

I tteration -
3 lterntlcms ---
IO iterations
SO iterations ..‘....
20 -
..;/
IS -

I I
2s SO 100 200 2s SO 100 200
Root set sue Root set size

Figure 5. “Optimization” Figure 6. “Operations Research”

234

You might also like