02 - Podnar-Beyond Term Indexing - A P2P Framework..

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Informatica 30 (2006) 153161 153

Beyond Term Indexing: A P2P Framework for Web Information Retrieval


Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm and Karl Aberer
School of Computer and Communication Sciences
Ecole Polytechnique Fdrale de Lausanne (EPFL)
CH-1015 Lausanne, Switzerland
Keywords: information retrieval, peer-to-peer overlay networks
Received: December 05, 2005
Web search over peer-to-peer (P2P) networks shows promise to become an alternative to the state-of-the-art
search engines since P2P overlays offer means for decentralized search across widely-distributed document
collections. However, the design of effective techniques for P2P indexing and retrieval raises a number of
technical challenges due to potentially unscalable resource (e.g. bandwidth, storage) consumption.
The paper presents a framework for full-text information retrieval in structured P2P networks and intro-
duces a novel retrieval model based on highly discriminative keysterms and term sets appearing in a
restricted number of documentsthat ensure efcient and scalable retrieval. Our goal is to design scalable
techniques for building a global key index in structured P2P overlays for large document collections. We
present experimental results that show acceptable indexing and retrieval costs while the retrieval quality is
comparable to standard centralized solutions with BM25 relevance computation scheme.
Povzetek: Razvito je P2P ogrodje za internetne iskalnike.
1 Introduction
Web search over P2P overlay networks has the potential to
become an alternative to current Web search engines due
to its decentralized nature and favorable scalability proper-
ties. Contemporary Web search engines are in essence cen-
tralized and require a central coordination service, which,
even when replicated, has been identied as a major system
bottleneck [5]. P2P overlays are self-organizing networks
for resource sharing that require no central coordination.
Moreover, they require minimal infrastructure and mainte-
nance, and enable higher diversity in contents and search
methods, which makes P2P Web search appealing from a
business perspective [4].
However, there is an ongoing debate on the feasibility of
P2P Web search for scalability reasons. In [9] it is shown
that the nave use of P2P networks is practically infeasi-
ble for Web search, since the generated trafc exceeds the
available capacity of the Internet. Thus different schemes
have been devised to make P2P Web search feasible. Sev-
eral approaches target at a term-to-peer indexing strategy,
where the unit of indexing are peers rather than individ-
ual documents [7, 3]. Hierarchical solutions for resource
discovery have also been proposed where a backbone P2P
network maintains a directory service which routes queries
to peers with relevant resources, i.e. documents [2, 11].
At this point little evidence is available whether these ap-
proaches can scale to very-large scale P2P Web search en-
gines.
In this paper we introduce a general framework for infor-
mation retrieval (IR) over structured P2P networks which
keeps indexing at document granularity while ensuring
scalable retrieval costs. Our assumption is that peers are
cooperative and provide documents for indexing that will
become searchable through a global index. At the same
time, they offer computing and storage resources to build
and maintain the global index and the underlying P2P net-
work. Peers choose independently the indexing features
and related posting lists for their local document collec-
tion and, for this task, combine their local knowledge and
the global knowledge maintained by the P2P overlay. A
peer basically advertises its local document collection and
collaborates with other peers by taking the responsibility
for a part of the global indexing space. Note that a sin-
gle peer cannot impose the rules used at a global level, but
a malicious peer might impede the search performance by
providing ctive descriptions of its local collection.
The framework assumes an underlying distributed hash
table (DHT) which maintains a global key-to-document in-
dex. The approach is characterized by the following central
idea: We carefully select the indexing keys so that they con-
sist of terms and termsets that are rare and thus discrimina-
tive with respect to a global document collection. As such,
they appear in a limited number of documents, thus limit-
ing the size of the posting lists associated with the keys in
the global inverted index. This directly addresses the main
problem identied in [9] for structured P2P Web search,
namely the processing and transmission of extremely large
posting lists. The achieved key-based indexing procedure
results in an extremely efcient retrieval because the post-
ing lists associated with the keys in the global P2P index are
short and correspond to precomputed and therefore readily
available results for potential multiple term queries.
Many possible techniques can be considered for creating
154 Informatica 30 (2006) 153161 I. Podnar et al.
a rare key vocabulary, but the major issue to cope with is the
fact that such a vocabulary can easily become unmanage-
able in size because it theoretically grows with 2
|V |
, where
V is the single-term vocabulary. Therefore, we present a
particular instantiation of our key-indexing which creates
keys by combining terms appearing in well-dened con-
texts in a limited number of documents. We hereafter refer
to such keys as highly discriminative keys (HDKs). Our
experiments show the following: The growth of the HDK
vocabulary per peer and the size of the global index per peer
are logarithmic when increasing the global document col-
lection by adding new peers that contribute with their local
documents to the global collection. More importantly, the
average posting list size remains constant when increasing
the global document collection size, which bounds the gen-
erated trafc during retrieval. In addition, the observed re-
trieval performance is comparable to the one achieved with
a single-termcentralized model using BM25 (Okapi) as rel-
evance computation scheme. Our indexing scheme there-
fore represents a contribution toward scalable P2P Web
search engines that opens the opportunity to index and
search virtually unlimited document collections, well be-
yond the capacity of todays best centralized solutions.
The rest of the paper is structured as follows: Section 2
gives an overview of existing strategies for P2P informa-
tion retrieval and introduces our concept of indexing with
HDKs. Section 3 presents the P2P framework for build-
ing a global full-text index in structured P2P overlays with
scalable retrieval costs and provide details about the HDK-
based retrieval model. In Section 4 we analyze its perfor-
mance through experimental evaluations using our truly-
distributed P2P-IR prototype. The related work in the area
of P2P information retrieval is revisited in Section 5, and
we conclude the paper in Section 6.
2 Web information retrieval over
P2P overlays
There are two main strategies for designing P2P search
engines in the area of IR: The rst strategy uses unstruc-
tured/hierarchical P2P networks where peers maintain lo-
cal indexes of their document collections, and the second
strategy builds a global index in structured P2P networks.
In a P2P network with N peers and a global document
collection D, the rst strategy divides D over the peer net-
work, and each peer P
i
maintains the index of its local doc-
ument collection D
i
. Such indexes are in principle inde-
pendent, and a nave solution broadcasts a query to all the
peers generating an unscalable number of messages. To
limit the query trafc, more advanced approaches use e.g.
random walks [12] or answer the query at two levels, the
peer and document level. At the rst level a group of peers
with relevant document collections is located, while at the
second level the query is submitted to relevant peers that
produce answers by querying their local indexes. The an-
swers are subsequently merged to produce a single ranked
answer set. A number of different approaches for the initial
peer selection process have been proposed in the context of
unstructured [7], hierarchical [2, 10], and structured P2P
overlays [3].
The second strategy splits the global document index
over a structured P2P network that maintains a global DHT
mapping of each indexing feature to the related posting
list [15, 18]. Structured P2P overlays offer efcient data
inserts and searches [1]: insert(key, data) routes the data
to the peer responsible for the given key, and search(key)
retrieves the data for the given key. Such networks typically
guarantee the routing latency of O(logN) number of hops,
while the routing information maintained by each peer is
bounded by O(logN).
term 1 posting list 1
term 2 posting list 2
term M-1 posting list M-1
term M posting list M
... ...
long posting lists
s
m
a
l
l

v
o
c
.
key 11 posting list 11
key 12 posting list 12
key 1i posting list 1i
... ...
short posting lists
l
a
r
g
e

v
o
c
.
PEER 1
...
key N1 posting list N1
key N2 posting list N2
key Nj posting list Nj
... ...
PEER N
PEER 1
PEER N
...
indexing with rare keys
nave approach
Figure 1: The basic idea of indexing using rare keys
The nave approach splits the global index over a peer
network as depicted in the upper part of Figure 1 and makes
each peer responsible for a disjoint set of indexing terms
from the global vocabulary. While solving the issue of stor-
ing very large indexes, such an approach is practically in-
feasible and unscalable because of extremely large posting
lists for frequent terms. The key-based indexing approach
directly addresses this issue by limiting the size of the post-
ing lists as shown in the lower part of Figure 1. The key
index only contains single terms and term sets that are rare
and thus discriminative with respect to a document collec-
tion. Notice that the extension of the index entries to rare
terms and term sets, while limiting the size of the posting
lists, entails the expansion of the key vocabulary. How-
ever, the distributed nature of the global index can provide
a virtually unlimited storage space if the number of peers
is large enough, but only in case the growth of the key vo-
cabulary size and the corresponding index size is scalable.
We dene a key k from the set of keys K as a set of
terms {t
1
, t
2
, . . . , t
s
}. These terms are elements of the doc-
ument collection indexing vocabulary V appearing in a sin-
gle document d D. The number of terms comprising a
key is bounded, i.e. 1 s s
max
.
BEYOND TERM INDEXING: A P2P FRAMEWORK FOR. . . Informatica 30 (2006) 153161 155
The quality of a key k for a given document d with re-
spect to indexing adequacy is determined by its discrim-
inative power. To be discriminative, a key k must be as
specic as possible with respect to the document d it is as-
sociated with and the corresponding document collection
D [17]. In this perspective, we categorize a key on the
basis of its global document frequency (DF), and dene a
threshold DF
max
to divide the set of keys K into two dis-
joint classes, rare and frequent keys. If a key k appears in
more than DF
max
documents, i.e. DF(k) > DF
max
, the
key is frequent, and has low discriminative power. Other-
wise, k is rare and specic with respect to the documents it
is associated with in the document collection.
The rareness property is in line with the capacity con-
straints of P2P networks that are capable of storing and
transmitting posting lists of limited size. However, the set
of rare keys, although bounded for a limited document col-
lection size, might be extremely large. It is therefore im-
portant to select among the potential key candidates those
that have favorable properties for the retrieval performance
of the P2P search engine. Additionally, it is important to
understand that keys can be interpreted as discriminative
user queries that can be answered very efciently because
the global index already stores precomputed answers asso-
ciated with such queries.
3 P2P framework for information
retrieval with HDKs
In this section we present a general framework for full-text
retrieval in a P2P environment using key-based indexing.
The framework is designed for building a global key-based
index in structured P2P networks and takes into account ca-
pacity constraints of P2P networks, such as available band-
width and storage. Resource management is performed at
two levels, locally as a peer builds an index of its local
document collection to be subsequently inserted into the
global index, and globally as the peer maintains a part of
the global index. Note that a single peer can simultane-
ously perform tasks at both levels.
3.1 Indexing
Indexing can be divided into the following three steps:
1. local selection of the vocabulary,
2. local selection of postings, and
3. global selection of postings.
The two initial steps represent local efforts of a single
peer to choose suitable indexing features for its local doc-
ument collection since each peer is competing with others
for a part of the global index space. The third step is per-
formed by all the peers maintaining the global index.
As the example in Figure 2 shows, P
i
builds the key vo-
cabulary K
i
for its local document collection D
i
. In the
next step it chooses a set of postings associated with each
key according to the global resource constraints it is aware
of, and subsequently inserts the (key, posting list) pairs
into the P2P overlay. In the given example, P
j
is respon-
sible for three such pairs from P
i
. Concurrently, two other
peers P
k
and P
l
have built a vocabulary for their document
collections, and have started inserting (key, posting list)
pairs into the P2P overlay. In the third step P
j
chooses
postings to be stored in the global index, e.g., it builds a
posting list for k
x
using the postings received from P
i
, P
k
and P
l
.
We will now describe in more details each of the three
indexing steps.
3.1.1 Vocabulary selection
A peer chooses locally rare keys because it assumes they
are good indexing candidates that are potentially globally
rare. A peer can also put frequent keys into the key vo-
cabulary if they are strongly representative of the docu-
ments they are associated with (as measured by standard
IR weighting schemes). The goal of this phase is to se-
lect discriminative keys that meet the following criteria:
the selected keys can achieve good retrieval quality, they
are likely to appear in user queries, and the key vocabulary
does not contain redundant keys.
Possible key ltering mechanisms for selecting a good
key vocabulary are the following:
Proximity. The proximity lter uses textual context to re-
duce the size of the rare key vocabulary, and retains
keys only composed of terms appearing in the same
textual context, e.g., a phrase, a paragraph, or a docu-
ment windowof a given maximal size. The main argu-
mentation behind this approach is that words appear-
ing close to each other in documents are good candi-
dates to appear together in a query. The analysis pre-
sented in [16] reports the importance of text passages
that are more responsive to particular user needs than
full documents. A similar reasoning is used in a re-
cently proposed method for static index pruning [8]
which indexes signicant sentences, i.e. phrases ap-
pearing in similar contexts.
Redundancy. A key k is semantically adequate for a doc-
ument d if there is a high probability that a user pro-
duces k when searching for d. As the user is more
likely to generate generic terms in a query, a semanti-
cally adequate key contains a specic combination of
quite generic and frequent terms. In other words, the
redundancy lter ensures that all term subsets of a key
are frequent keys. Note that all supersets of rare keys
are redundant, as they only increase the vocabulary
without improving retrieval performance.
Top-k keys. As it is possible to compute relevance be-
tween a document and a key, we can choose for each
document the top-k most relevant keys. The number
of chosen keys can, for example, depend on document
156 Informatica 30 (2006) 153161 I. Podnar et al.
ki1
ki2
kij
...
K
i
D
i
ki1 posting listi1
ki2 posting listi2
kij posting listij
... ...
Step 1
select
vocabulary
Step 2
select
postings
(locally)
Peer
i
...
insert
(key, posting list)
into P2P overlay
Peer
j
Step 3
select postings
(globally)
Peer
l
(klx, posting listlx)
(klz, posting listlz)
Peer
k
(kx, posting listx)
(ky, posting listy)
(kz, posting listz) (kkx, posting listkx)
(kky, posting listky)
(kix, posting listix)
(kiy, posting listiy)
(kiz, posting listiz)
Figure 2: The framework for P2P indexing
length and be controlled by a relevance threshold. The
key vocabulary size can also be limited to a predened
value.
Query-adaptive indexing. Keys are coined by combining
terms appearing in past user queries.
Note that the above mentioned techniques can be com-
bined to reduce the key vocabulary size. Our current pro-
totype uses the proximity and redundancy lter, and com-
putes HDKs by combining terms appearing within a doc-
ument window of maximal size w while ensuring that all
key subsets are globally frequent. We also extend the key
vocabulary with frequent keys to improve retrieval perfor-
mance. The top-k indexing of frequent keys is feasible be-
cause our analysis shows that the size of the frequent key
vocabulary is less than 1% of the HDK vocabulary size.
Computing HDKs. We compute HDKs of increasing
size because a peer needs to know s-level globally frequent
keys to compute (s + 1)-level keys since these are candi-
dates for expansion with additional keys. An indexing peer
rst categorizes terms from V as locally rare and locally
frequent single-term keys, and publishes their local DFs in
the P2P network. In our current implementation the index-
ing peer needs to be informed if keys from its local doc-
ument collection are globally frequent. Locally frequent
keys are necessarily globally frequent, but locally rare keys
may still be globally frequent. As the indexing peer has no
local knowledge about the global key DFs, it relies on the
P2P network which maintains the global statistics and no-
ties the indexing peer if a key considered as locally rare is
globally frequent. With such a notication mechanism in
place, each indexing peer will have a consistent knowledge
about globally frequent keys for its local document collec-
tion. A globally frequent key is expended by another fre-
quent key with a restriction that their terms occur within a
predened window as dened by the proximity lter. The
indexing peer counts the number of documents in which
this property holds to determine local DF of a newly cre-
ated key, and categorizes it as locally rare or frequent. Then
it publishes the key and its DF into the P2P network. The
process continues until s
max
-term keys are created.
3.1.2 Local selection of postings
In this phase, a peer decides which documents are indexed
for its key vocabulary. Since there are capacity constrains
that restrict the number of postings associated with a key,
each peer is allowed to insert only a limited number of post-
ings, e.g. controlled by a globally dened value DF
max
.
Firstly, a peer can insert all postings for locally rare keys,
and the P2P overlay then decides globally about the set of
postings that get stored. Secondly, a peer can be notied
when its already published locally rare keys become glob-
ally frequent, and using this global knowledge provided by
the P2P network, insert postings only for globally rare keys,
which will reduce the indexing trafc compared to the rst
solution. Thirdly, for locally and globally frequent keys,
a peer can publish top-DF
max
representative postings by
ranking documents based on their relevance with respect to
a given key.
Our current prototype implements the second and third
option: Indexing peers publish all local postings for glob-
ally rare keys. To improve retrieval performance and ef-
ciently answer queries composed of frequent terms and
term combinations, we have decided to index also glob-
ally frequent keys and maintain only DF
max
postings in
the global index. Indexing peers therefore also insert top-
DF
max
postings associated with their local keys that are
globally frequent.
3.1.3 Global selection of postings
This phase takes place at the peer responsible for storing
and maintaining the posting list associated with a given key.
The total index size maintained by the peer is has to comply
with its available storage space, which in turn also inu-
ences the maximum number of postings associated with a
key. Furthermore, the posting list size is restricted to meet
capacity constraints imposed by the retrieval phase.
BEYOND TERM INDEXING: A P2P FRAMEWORK FOR. . . Informatica 30 (2006) 153161 157
For globally rare keys peers store all received posting
lists because rare key are by denition associated with short
posting lists. If there are more postings than the constraints
allow, the responsible peer has to select a set of postings
to keep as it is already done locally for frequent keys. The
mechanisms for choosing DF
max
postings can range from
a simple randomselection procedure, to sophisticated rank-
ing of postings using available techniques for content and
link-based ranking. Our prototype currently stores all re-
ceived posting associated with globally rare keys and top-
DF
max
ranked postings associated with frequent keys, i.e.
documents with highest relevance to a frequent keys ac-
cording to TF-IDF scores.
3.2 Retrieval
The retrieval part of our framework deals with the problem
of nding, given a query Q = {t
1
, t
2
, . . . , t
q
}, the corre-
sponding relevant keys in the global P2P index, retrieving
the postings associated with those keys, and ranking of the
result set. It is performed in two steps:
1. mapping of a query to keys and subsequent retrieval
of postings from the P2P overlay, and
2. merging of the received answers with a global ranking
scheme in order to produce a single hit list.
Figure 3 shows the querying process when peer P
i
re-
ceives the query Q. We assume Q is a simple set of terms,
and it must be mapped to a set of keys present in the P2P
global index. All term subsets of a query are possible key
candidates that may be stored in the P2P index, and P
i
needs to explore the lattice of term combinations and check
which key candidates are indeed keys from the global in-
dex.
The optimistic strategy is to start with the largest possi-
ble term set (marked as level-1), limited either by the query
size q (as it is assumed in Figure 3) or the maximal number
of terms forming a key (s
max
). P
i
will try to retrieve the
keys sequentially by exploring initially level-1 keys, and
then depending on the result of the previous step continue
with level-2 keys, . . ., level-q keys. The perfect situation
occurs when k
11
= {t
1
, t
2
, . . . , t
q
} is an HDK, in other
words, a user has posed a good discriminative query for the
indexed document collection: The posting list is readily
available and is simply retrieved from the global index by
issuing a single request search(k
11
) to the P2P network.
Indeed, this will not happen with all user queries. There-
fore, level-2 set of potential key candidates of size q 1
is explored. If the P2P network stores postings associated
with level-2 keys that cover all terms in Q, the hit list is the
union of the retrieved postings. However, it is possible that
either (a) no level-2 keys exist in the index, or (b) that some
t
i
Q are not covered by the retrieved keys. In case of (a),
P
i
explores all level-3 keys and subsequently higher level
keys until nally reaching level-q. In case of (b), P
i
ex-
plores level-3 keys but only those that contain non-covered
terms from the previous levels. Depending on the retrieved
set of keys, the procedure is extended to higher-level keys
until all t
i
Q are covered by the retrieved key set. The re-
sulting answer set is the union of postings associated with
the retrieved keys.
However, in some cases the answer set may be empty
because all key candidates are frequent. A valid option is
to notify a user that the query is non-discriminative with
respect to the document collection, and offer support for
query renement. Another possibility is to apply additional
measures and, for example, index frequent keys or perform
query expansion that maps terms to semantically related
terms which enables the creation of many key-candidates
and increases the probability of nding relevant keys in
the global index. As our prototype currently indexes fre-
quent keys, all single-term keys from V can eventually be
retrieved, which solves the problem of queries that map to
frequent keys.
Let us again consider the example in Figure 3. The P2P
network does not contain k
11
, and P
i
searches for level-2
keys. Out of q possible level-2 keys, only k
21
is an HDK
and its posting list is retrieved from P
l
. However, t
q
is still
not covered, and P
i
unsuccessfully explores keys on higher
levels containing t
q
, until nally reaching level-q where a
posting list is associated with k
qq
.
Both posting lists are merged by a simple union proce-
dure in the second step. Note that there is a possibility that
t
q
appears in some documents associated with k
21
, but as
it does not appear within the predened window, k
11
was
not produced during the indexing procedure. Note also that
documents associated with k
qq
are those with highest rele-
vance to t
q
, but they may also contain other terms from Q,
also outside the window. The ranking function produces
the nal ranked result set giving higher scores to documents
with higher relevance to Q. The peer uses available ranking
methods, either content based, and/or link-based, to pro-
duce the nal ranking of the merged result set. The ranking
procedure must be implemented in a distributed fashion us-
ing global document collection statistics such as global DFs
available from the P2P index.
4 Experimental evaluation
Experiments have been performed using our prototype
P2P search engine which is built on top of the P-Grid
P2P layer [19]. Our implementation is a fully-functional
P2P search engine that integrates a solution for distributed
maintenance of global key vocabulary with associated doc-
ument frequencies, calculates HDKs used for indexing in
a completely distributed fashion, and stores posting lists in
a global index for computed keys. The presented experi-
ments analyze both rare and frequent keys associated with
top-DF
max
postings, and the corresponding index sizes.
Experimental setup. The experiments were carried out
using a subset of news articles from the Reuters corpus
1
.
1
http://about.reuters.com/researchandstandards/corpus/
158 Informatica 30 (2006) 153161 I. Podnar et al.
Peer
i
k
11
= {t
1
, t
2
, t
q
}
k
21
={t
1
, t
2
, t
q-1
}, k
22
, , k
2q
...
kq1= {t1}, kq2= {t2} kqq= {tq}
Step 1
query-to-key
mapping
...
Search sequentially for keys
starting with level 1, then 2,
, with a goal of covering all
terms t1, t2, tq.
query
Q = {t
1
, t
2
, t
q
}
ranked hit
list for Q
Step 2
rank postings
using global
metrics
Peer
k
Peer
l
posting list_k21
hit list
posting list_kqq
(level-1)
(level-2)
...
(level-q)
Figure 3: The framework for P2P retrieval
The documents in our test collection contain between 70
and 3000 words, while the average number of terms in a
document is 170, and the average number of unique terms
is 102. To simulate the evolution of a P2P system, i.e. peers
joining the network and increasing the document collec-
tion, we started the experiment with 4 peers, and added
additional 4 peers at each new experimental run. Each peer
contributes with 5000 documents to the global collection,
and computes HDKs for its local documents. Therefore,
the initial global document collection for 4 peers is 20,000
documents, and it is augmented by the new 20,000 docu-
ments for each experimental run. The maximum number of
peers is 24, hosting in total the global collection of 120,000
documents. The experiments were performed on our cam-
pus intranet. Each peer runs on a Linux RedHat PC with
1GB of main memory connected via 100 Mbit Ethernet.
The prototype system is implemented in Java.
Performance analysis. Experiments investigate index-
ing and retrieval costs in terms of bandwidth and stor-
age consumption, and compare retrieval performance of
our P2P search engine to a centralized engine. All docu-
ments are pre-processed: First we removed 250 common
English stop words and applied the Porter stemmer, and
then we removed 100 extremely frequent terms (e.g. the
term reuters appears in all the news). DF
max
is set to
250 and 500, and s
max
is 3. We have decided to build
keys with a maximum of 3 terms having observed that an
increase of s
max
substantially increases the computational
load without signicant improvement of retrieval perfor-
mance. In particular this is benecial for P2P Web search
engines because short queries are the ones typically used
for Web retrieval. We set the window size to 20 words for
the proximity lter as this is the average length of phrases
in the collection.
Figure 4 shows the growth of the HDK vocabulary per
peer maintained in the global index when increasing the
number of documents by adding new peers. Note that
each peer contributes an equal number of documents into
the global collection. As expected, an increased value of
DF
max
reduces the key vocabulary size because there are
0
50000
100000
150000
200000
250000
300000
350000
400000
20000 40000 60000 80000 100000 120000
#Documents
#
K
e
y
s
DFmax=250 DFmax=500
Figure 4: HDK vocabulary per peer
less frequent single-term keys to build 2 and 3-term rare
keys. Both experimentally obtained result sequences show
logarithmic growth and are expected to converge to a con-
stant value for large document collection sizes. We can
conclude that a peer will maintain a negligibly increasing
number of HDKs when increasing the document collec-
tion size by adding new peers. Therefore, our experiments
show that the size of the total key vocabulary maintained
in the global P2P index will increase almost linearly with
the number of documents for large document collections.
The number of keys is signicantly larger compared to the
single-term vocabulary that grows with

x (Heaps law),
but we expect benets in terms of retrieval efciency and
bandwidth consumption.
Figure 5 compares the average posting list size in case
of HDK and single-term indexing. Surprisingly, the aver-
age posting list size of the HDK index remains constant
and much smaller than DF
max
, although we only limit the
maximum size of posting lists. It is superior to single-term
indexing that exhibits a linear increase of the average post-
ing list size. This nding has a major impact on the band-
width consumption during retrieval as it is shown in Fig-
ure 6 that depicts the average number of retrieved postings
per query and compares single-term to the HDK approach
with DF
max
= 250 and DF
max
= 500.
Since a query log is not available for the Reuters corpus,
BEYOND TERM INDEXING: A P2P FRAMEWORK FOR. . . Informatica 30 (2006) 153161 159
0
20
40
60
80
100
120
20000 40000 60000 80000 100000 120000
#Documents
#
P
o
s
t
i
n
g
s
HDK, DFmax=250 HDK, DFmax=500 ST
Figure 5: Average posting list size
we have preprocessed a Wikipedia query log
2
to generate
1,000 queries. From all available queries we have initially
extracted 1,000,000 queries that have more than 20 hits
from the Reuters corpus. The resulting 1,000 queries were
selected randomly from this set. The generated queries
contain on average 3.14 terms, with a minimum of 2 and
maximum of 8 terms. Note that longer queries are favor-
able for our search engine because the probability of nd-
ing precomputed rare keys in the P2P index increases. Sin-
gle term queries would generate bounded trafc since only
DF
max
postings are retrieved and are therefore not consid-
ered.
Figure 6 shows an enormous reduction of bandwidth
consumption per query of the HDK-based approach com-
pared to the nave single term indexing: 92% for DF
max
=
500 and even 96% for DF
max
= 250. In addition, the
reduction increases when increasing the number of doc-
uments in the global collection which practically demon-
strates the effect of the bounded number of index postings
to bandwidth consumption during retrieval.
0
2000
4000
6000
8000
10000
12000
14000
80000 100000 120000
#Documents
#
P
o
s
t
i
n
g
s
HDK, DFmax=250
HDK, DFmax=500
ST
Figure 6: Number of retrieved postings per query
The essential question remains whether the retrieval per-
formance of the HDK approach is satisfactory and compa-
rable to centralized counterparts. Due to the lack of rele-
vant judgment for the generated query set, we have com-
pared the retrieval performance to a centralized engine
3
with BM25 relevance computation scheme which is cur-
rently considered as one of the top performing relevance
schemes [13].
2
www.wikipedia.org, query log 08/2004 and 09/2004
3
Terrier search engine, http://ir.dcs.gla.ac.uk/terrier/
0
10
20
30
40
50
60
70
80
90
100
80000 100000 120000
#Documents
o
v
e
r
l
a
p
[
%
]
HDK, DFmax=250
HDK, DFmax=500
ST
Figure 7: Overlap on top 20
Figure 7 presents the overlap on top-20 retrieved docu-
ments retrieved by our system and the Terrier search en-
gine. We are interested in the high-end ranking as typ-
ical users are often interested only in the top 20 results.
The comparison shows signicant and satisfactory over-
lap between the retrieved result sets. As expected, the re-
trieval performance is better for larger DF
max
as we are
approaching single-term indexing (when DF
max
equals
vocabulary size, all HDKs are single-term). There is ob-
viously a trade-off between retrieval quality and bandwidth
consumption of our indexing strategy because an increased
value of DF
max
results in an increased bandwidth con-
sumption during retrieval, while on the other hand, offers
retrieval performance that better mimics centralized en-
gines. However, as required bandwidth is the major ob-
stacle for retrieval, it is vital to choose an adequate value
for DF
max
taking into account available network capacity.
There is another limiting factor inuenced by this param-
eter: Note that the process of building the key vocabulary
is less expensive in terms of computational costs and band-
width consumption for an increased DF
max
. This is intu-
itively clear because for small values of DF
max
, the num-
ber of potential terms for building rare keys increases, and
the whole process creates more precomputed keys, which
is obviously very efcient during the retrieval phase if such
term sets, i.e. keys, appear in user queries.
0
1000000
2000000
3000000
4000000
5000000
6000000
20000 40000 60000 80000 100000 120000
#Documents
#
P
o
s
t
i
n
g
s
DFmax=250 DFmax=500 ST
Figure 8: Number of postings per peer (index size)
To quantify indexing costs and the inuence of DF
max
on required storage and bandwidth consumption, Figure 8
shows the average number of postings stored per peer for
DF
max
= 250, DF
max
= 500, and the single-term index.
160 Informatica 30 (2006) 153161 I. Podnar et al.
This is the actual index size per peer. The curve is logarith-
mic for HDKs as it is mostly inuenced by the key vocab-
ulary size, and is linear for the single-term index (the de-
crease of the term vocabulary size per peer compensates for
the increased posting list size). It is visible that a peer hosts
signicantly more postings for the HDK approach, and the
index size increases when decreasing DF
max
. The index-
ing costs are still feasible and protable because the gains
in terms of bandwidth consumption during the query phase
are huge, and compensate for the increased indexing costs.
Nevertheless, we plan to investigate further techniques to
reduce the index size, and study the inuence of DF
max
on bandwidth consumption during indexing and retrieval,
and on the resulting retrieval performance.
5 Related Work
To the best of our knowledge, [6] is the only published P2P
framework for information retrieval that is based on dis-
tributed semantic indexing on top of structured P2P net-
works, and therefore highly related to our work. However,
it uses a different indexing approach which relies on se-
mantic locality to cluster documents with similar seman-
tics, and to store them on nearby peers.
A number of solutions have been proposed to cope with
the scalability problem of P2P information retrieval. Re-
cent approaches combine global knowledge with local in-
dexes: For example, PlanetP [7] gossips compressed in-
formation about peers collections in an unstructured P2P
network, while MINERVA [3] maintains a global index
with peer collection statistics in a structured P2P overlay
to facilitate the peer selection process, and implements a
method which penalizes peers holding overlapping docu-
ment collections. A similar approach for resource selection
is presented in [10, 11] in the context of hierarchical P2P
networks where special directory nodes route queries to ap-
propriate peers having high chances of answering a query.
A recent solution builds an index dynamically following
user queries [2]. A super-peer backbone network main-
tains the information about good candidates for answering
a query, while peers answer the queries based on their local
document collections. Since large posting lists are the ma-
jor concern for global single-term indexing, both [15] and
[18] have proposed top-k posting list joins, Bloom lters,
and caching as promising techniques to reduce search costs
for multi-term queries.
The listed solutions are orthogonal to our approach since
they use different assumptions to reduce network trafc.
However, our approach is not the only indexing strategy
that uses term sets as indexing features. The set-based
model [14] indexes term sets occurring in queries, and ex-
ploits term correlations to reduce the number of indexed
term sets. The authors report signicant gains in terms
of retrieval precision and average query processing time,
while the increased index processing time is acceptable. In
contrast to our indexing scheme, the set-based model has
been used to index frequent term sets, and has been de-
signed for a centralized setting.
6 Conclusion
We have presented a P2P framework for information re-
trieval that uses a novel retrieval model based on global
indexing of rare keys in structured P2P overlay networks.
Rare keys are terms and term sets occurring in a limited
number of documents, limiting thus the size of posting
lists stored in the global index. This approach directly ad-
dress the unscalable network trafc cased by large post-
ing lists for single-term indexing. The experimental results
have shown the potential of our approach in preserving a
retrieval quality (top-k precision) comparable to the stan-
dard single term with BM25 scoring scheme with enor-
mous reduction of generated trafc during retrieval. Next,
it gives evidence that the indexing process is feasible be-
cause it produces the key vocabulary and global index of
manageable size. More importantly, the average posting
list size remains constant when increasing the global docu-
ment collection size. Therefore, the generated trafc during
retrieval remains bounded, which solves one of the major
obstacles for scalable IR in P2P networks.
Acknowledgements
The work presented in this paper was carried out in the
framework of the EPFL Center for Global Computing and
supported by the Swiss National Funding Agency OFES as
part of the European FP 6 STREP project ALVIS (002068).
References
[1] K. Aberer, L. O. Alima, A. Ghodsi, S. Girdzijauskas,
S. Haridi, and M. Hauswirth. The Essence of P2P: A Refer-
ence Architecture for Overlay Networks. In Fifth IEEE In-
ternational Conference on Peer-to-Peer Computing, pages
1120, 2005.
[2] W.-T. Balke, W. Nejdl, W. Siberski, and U. Thaden. Dl
meets p2p - distributed document retrieval based on clas-
sication and content. In 9th European Conference on
Research and Advanced Technology for Digital Libraries,
(ECDL), pages 379390, 2005.
[3] M. Bender, S. Michel, P. Triantallou, G. Weikum, and
C. Zimmer. Improving collection selection with overlap
awareness in P2P search engines. In Proceedings of the
28th annual international ACM SIGIR conference on Re-
search and development in information retrieval (SIGIR
05), pages 6774, 2005.
[4] W. Buntine, K. Aberer, I. Podnar, and M. Rajman. Opportu-
nities from open source search. In Proceedings of the 2005
IEEE/WIC/ACM International Conference on Web Intelli-
gence, pages 28, 2005.
[5] F. Cacheda, V. Plachouras, and I. Ounis. A case study of
distributed information retrieval architectures to index one
terabyte of text. Inf. Process. Manage., 41(5):11411161,
2005.
BEYOND TERM INDEXING: A P2P FRAMEWORK FOR. . . Informatica 30 (2006) 153161 161
[6] Y. Chen, Z. Xu, and C. Zhai. A scalable semantic indexing
framework for peer-to-peer information retrieval. In SIGIR
2005 workshop: Heterogeneous and Distributed Informa-
tion Retrieval, 2005.
[7] F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D.
Nguyen. PlanetP: Using Gossiping to Build Content Ad-
dressable Peer-to-Peer Information Sharing Communities.
In 12th IEEE International Symposium on High Perfor-
mance Distributed Computing (HPDC-12), June 2003.
[8] E. S. de Moura, C. F. dos Santos, D. R. Fernandes, A. S.
Silva, P. Calado, and M. A. Nascimento. Improving Web
search efciency via a locality based static pruning method.
In WWW 05: Proceedings of the 14th International Con-
ference on World Wide Web, pages 235244, 2005.
[9] J. Li, B. Loo, J. Hellerstein, F. Kaashoek, D. Karger, and
R. Morris. The feasibility of peer-to-peer web indexing and
search. In Peer-to-Peer Systems II: 2nd International Work-
shop on Peer-to-Peer Systems (IPTPS), pages 207

U215,
2003.
[10] J. Lu and J. Callan. Content-based retrieval in hybrid peer-
to-peer networks. In Proceedings of the 12th International
Conference on Information and Knowledge Management
(CIKM 03), pages 199206, 2003.
[11] J. Lu and J. Callan. Federated search of text-based digital
libraries in hierarchical peer-to-peer networks. In Advances
in Information Retrieval, 27th European Conference on IR
Research (ECIR), pages 5266, 2005.
[12] Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker. Search and
replication in unstructured peer-to-peer networks. In 16th
International Conference on Supercomputing, June 2002.
[13] V. Plachouras, B. He, and I. Ounis. University of Glasgow
at TREC2004: Experiments in Web, Robust and Terabyte
tracks with Terrier. In Proceeddings of the 13th Text RE-
trieval Conference (TREC 2004), 2004.
[14] B. Pssas, N. Ziviani, J. Wagner Meira, and B. Ribeiro-
Neto. Set-based vector model: An efcient approach
for correlation-based ranking. ACM Trans. Inf. Syst.,
23(4):397429, 2005.
[15] P. Reynolds and A. Vahdat. Efcient Peer-to-Peer Keyword
Searching. Middleware03, 2003.
[16] G. Salton, J. Allan, and C. Buckley. Approaches to Passage
Retrieval in Full Text Information Systems. In Proceedings
of the 16th Annual International ACMSIGIR Conference on
Research and Development in Information Retrieval, pages
4958, 1993.
[17] G. Salton and C. Yang. On the specication of term val-
ues in automatic indexing. Journal of Documentation,
4(29):351372, 1973.
[18] T. Suel, C. Mathur, J.-W. Wu, J. Zhang, A. Delis, M. Khar-
razi, X. Long, and K. Shanmugasundaram. ODISSEA:
A Peer-to-Peer Architecture for Scalable Web Search and
Information Retrieval. Proc. Intl Workshop Web and
Databases (WebDB 03), pages 67

U73, 2003.
[19] The P-Grid Consortium. The P-Grid project, 2005.
http://www.p-grid.org/.
162 Informatica 30 (2006) 153161 I. Podnar et al.

You might also like