UPR: Usage-based Page Ranking for Web Personalization
Magdalini Eirinaki
Michalis Vazirgiannis
Athens University of Economics and Business
Dept. of Informatics
Athens, Greece
Athens University of Economics and Business
Dept. of Informatics
Athens, Greece
[email protected]
[email protected]
ABSTRACT
Recommendation algorithms aim at proposing “next” pages to a
user based on her navigational behavior. In the vast majority of
related algorithms, only the usage data are used to produce
recommendations. We claim that taking also into account the web
structure and using link analysis algorithms ameliorates the
quality of recommendations. In this paper we present UPR, a
personalization algorithm which combines usage data and link
analysis techniques for ranking and recommending web pages to
the end user. Using the web site’s structure and previously
recorded user sessions we produce personalized navigational subgraphs (prNGs) to be used for applying UPR. Experimental
results show that the accuracy of the generated recommendations
is superior to pure usage-based approaches.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications – Data
Mining; H.3.5 [Information Storage and Retrieval]: Online
Information Services - Web-based services
General Terms
Algorithms, Experimentation.
Keywords
Web Personalization, Link Analysis, PageRank, Usage-based
PageRank
1. INTRODUCTION
The evolution of world wide web as the main information source
for millions of people nowadays has imposed the need for new
methods and algorithms that are able to process efficiently the
vast amounts of data that reside on it. Users become more and
more demanding in terms of the quality of information provided
to them when searching the web or browsing a web site. The area
of web mining, including any method that utilizes data residing on
the web, namely usage, content and structure data, addresses this
need. The most common applications involve the ranking of the
results of a web search engine and the provision of
recommendations to users of – usually commercial – web sites,
known as web personalization.
PageRank is the most popular link analysis algorithm, used in
order to rank the results returned by a search engine after a user
query. The ranking is performed by evaluating the importance of
a page in terms of its connectivity to and from other important
1
1
pages. In the past there have been proposed many variations of
this algorithm, aiming at refining the acquired results. Some of
these approaches, make use of the so called “personalization
vector” of PageRank in order to bias the results towards the
individual needs of every user searching the web.
In this work, we introduce PageRank in a totally different context,
that of web personalization. Web personalization is defined as any
action that adapts the information or services provided by a Web
site to the needs of a user or a set of users, taking advantage of the
knowledge gained from the users’ navigational behavior and
individual interests, in combination with the content and the
structure of the Web site [10]. In the past, many approaches have
been proposed, based on pure web usage mining algorithms,
markov models, or a combination of usage and content mining
techniques.
Motivated by the fact that in the context of navigating a web site,
a page/path is important if many users have visited it before, we
propose a novel approach that is based on a personalized version
of PageRank, applied to the navigational tree created by the
previous users’ navigations. We personalize PageRank by biasing
it to “favor” pages and paths previously preferred by many users.
We prove that this hybrid algorithm can be applied to any web
site’s navigational graph as long as it satisfies certain properties.
Thus, it is orthogonal to any graph synopsis we may choose to
model the user sessions with, such as a Markov Chain, higherorder Markov models, tree-like synopses, etc. This approach is
therefore generic, proved to converge after a few iterations and
thus provides fast results, whereas we can fluctuate between
simplicity and accuracy by applying it to less or more complex
web navigational graph models. More specifically, our key
contributions are:
• UPR, a novel usage-based personalized PageRank-style
algorithm used for ranking the web pages of a site based
on previous users’ navigational behavior.
• A method for creating personalized navigational graph
synopses (prNG) to be used for applying UPR.
• A personalization method which combines usage and
structure data for ranking and recommending web pages to
the end user.
• A set of experimental results which prove that the
incorporation of link analysis in the web personalization
process improves the recommendations’ accuracy.
A modified version of this paper has appeared in the Proceedings of the 5th IEEE International Conference on Data Mining (ICDM ’05)
The rest of this paper is organized as follows: In Section 2 we
overview some related work in the areas of PageRank-based
algorithms, web usage mining and personalization. In Section 3
we overview the basic properties and functionality of the
PageRank algorithm, whereas in Section 4 we present our novel
algorithm, UPR and prove its convergence. Section 5 describes
the method of creating the Navigational Graph NtG and localized
(personalized) Navigational Graph synopses prNG in order to
provide personalized recommendations by applying UPR. Section
6 includes some experimental evaluation of the proposed
approach, whereas in Section 7 we conclude with insights to our
plans for future work.
2. RELATED WORK
Although the connectivity features of the web graph have been
extensively used for personalizing web search results [1, 11, 23],
only a few approaches exist that take them into consideration in
the web site personalization process. Zhu et. al. [26] use citation
and coupling network analysis techniques in order to conceptually
cluster the pages of a web site. The proposed recommendation
system is based on Markov models. Nakagawa and Mobasher [20]
use the degree of connectivity between the pages of a web site as
the determinant factor for switching among recommendation
models based on either frequent itemset mining or sequential
pattern discovery. Nevertheless, none of the aforementioned
approaches fully integrates link analysis techniques in the web
personalization process by exploiting the notion of the authority
or importance of a web page in the web graph.
In a very recent work, Huang et. al. [12] address the data sparsity
problem of collaborative filtering systems by creating a bipartite
graph and calculating linkage measures between unconnected
pairs for selecting candidates and make recommendations. In this
study the graph nodes represent both users and rated/purchased
items. Finally, subsequent to our work, Borges and Levene
[BL06] proposed independently two link analysis ranking
methods, SiteRank and PopularityRank which are the PageRank
algorithm and a personalized variation of it applied to a web site
graph. This work focuses on the comparison of the distributions
and the rankings of the two methods rather than proposing a web
personalization algorithm. The authors’ concluding remarks, that
the topology of the web site is very important and should be taken
into consideration in the web personalization process, further
support our claim. To the extent of our knowledge, however, this
is the first approach that uses a personalized usage-biased
PageRank-like algorithm in the context of web personalization.
In our approach, we propose the use of web navigational graph
synopses, such as Markov models. In the past, many researches
have proposed the use of 1st order (Markov Chains) [2, 5, 24],
higher-order [16], or hybrid [4, 7, 17, 25] Markov models. Even
though Markov Chains provide a simple way to capture sequential
dependence, they do not take into consideration the “long-term
memory” aspects of web surfing behavior. Higher-order Markov
models are more accurate for predicting navigational paths, there
exists, however, a trade-off between improved coverage and
exponential increase in state-space complexity as the order
increases. What's more, such complex models often require
inordinate amounts of training data. Moreover the hybrid models,
that combine different order Markov models, require much more
resources in terms of preprocessing and training. We should stress
at this point that our approach is orthogonal to the type of
synopsis one may choose. This synopsis may be a Markov model
of any order (depending on the simplicity and accuracy required),
or any other graph synopsis, such as those proposed in [21, 22].
Apart from Markov models, there also exist many approaches that
perform web usage mining for web personalization, based on
association rules mining, clustering, sequential pattern discovery,
frequent pattern discovery or collaborative filtering. Since this is
out of the scope of this paper, we refer the reader to [8, 9] for an
extensive overview of such approaches.
3. PRELIMINARIES & REVIEW OF
PAGERANK
The PageRank algorithm [3] is the most popular link analysis
algorithm, used broadly for assigning numerical weightings to
web documents and used from web search engines in order to
rank the retrieved results. The algorithm models the behavior of a
random surfer, who either chooses an outgoing link from the page
he’s currently at, or “jumps” to a random page after a few clicks.
The PageRank of a page is defined as the probability of the
random surfer being at some particular time step k > K at this
page. This probability is correlated with the importance of this
page, as it is defined based on the number and the importance of
the pages pointing to it. For sufficiently large K this probability is
unique, as illustrated in what follows.
Consider the web as a directed graph G, where the N nodes
represent the web pages and the edges the links between them.
The random walk on G induces a Markov Chain where the states
are given by the nodes in G, and M is the stochastic transition
matrix with mij describing the one-step transition from page j to
page i. The adjacency function mij is 0 if there is no direct link
from pj to pi, and normalized such that, for each j:
∑m
N
ij
=1
(1)
i =1
As stated by the Perron-Frobenius theorem, if M is irreducible
(i.e. G is strongly connected) and aperiodic, then Mk (i.e. the
transition matrix for the k-step transition) converges to a matrix in
which each column is the unique stationary distribution PR ∗ ,
independent of the initial distribution PR. The stationary
distribution is the vector which satisfies the equation:
PR * = M × PR *
(2)
in other words PR is the dominant eigenvector of the matrix M.
∗
Since M is the stochastic transition matrix over the web graph G,
PageRank is in essence the stationary probability distribution over
pages induced by a random walk on the web. As already implied,
the convergence of PageRank is guaranteed only if M is
irreducible and aperiodic [18]. The aperiodicity constraint is
guaranteed in practice in the web context, whereas the
irreducibility is satisfied by adding a damping factor (1-ε) to the
rank propagation (ε is a very small number, usually set to 0.15), in
order to limit the effect of rank sinks and guarantee convergence
to a unique vector. We therefore define a new matrix M’ by
adding low-probability transition edges between all nodes in G
(we should point out that in order to ensure that M’ is irreducible,
we should remove any dangling nodes, i.e. nodes with outdegree
0):
M ' = (1 − ε ) M + εu
(3)
In other words, the user may choose a random destination based
on the probability distribution of u. This process is also known as
teleportation. PageRank can then be expressed as the unique
solution to Equation 2, if we substitute M with M’:
PR = (1 − ε ) M × PR + εp
mij =
∑p
1
k
pk ∈Out ( p j )
1
and u = ⎡ ⎤
⎢⎣ N ⎥⎦
N ×N
otherwise. The personalization vector p is defined as
⎤
⎡
⎢ wi ⎥
p=⎢
⎥
⎢ ∑ wj ⎥
p
WS
∈
j
⎦ N ×1
⎣
(4)
, i.e. the probability
⎡1⎤
p = ⎢ ⎥ . By choosing, however, u, and consequently p, to
⎣ N ⎦ N ×1
of randomly jumping to another page is uniform. In that case,
follow a non-uniform distribution, we essentially bias the
resultant PageRank vector computation to favor certain pages.
Thus, u is usually referred to as the personalization vector. This
approach is broadly used in the web search engines’ context,
where the ranking of the retrieved results are biased by favoring
pages relevant to the query terms, or the user preferences to
certain topic categories [1, 11, 23]. In what follows, we present a
usage-biased version of PageRank algorithm, used for ranking the
pages of a web site based on the navigational behavior of previous
visitors.
We consider the directed navigational graph NG, where the nodes
represent the web pages of the web site and the edges represent
the consecutive one-step paths followed by previous users. Both
nodes and edges carry weights. The weight wi on each node
represents the number of times page pi was visited and the weight
wjåi on each edge represents the number of times pi was visited
right after pj.
Following the aforementioned properties of the Markov theory
and the PageRank computation, the usage-based PagerRank
vector UPR is the solution to the following equation:
UPR = (1 − ε ) M × UPR + εp
(5)
The transition matrix M on NG is defined as the square NxN
matrix whose elements mij equal to 0 if there does not exist a link
(i.e. visit) from page pj to pi and
(7)
Using the aforementioned formulas, we bias the PageRank
calculation to rank higher pages that were visited more often by
previous users. We then use this hybrid ranking, combining
structure and usage data of the site, to provide a ranked
recommendation set to current users.
Note that Equation 1 holds, that is, M is normalized such that the
sum of its columns equal to 1, therefore M is a stochastic
transition matrix, as required for the convergence condition of the
algorithm to hold. M is as already mentioned aperiodic in the web
context and irreducible, since we have included the damping
factor (1-ε). We also eliminate any dangling pages by adding links
to all other pages with uniform probabilities for all nodes with no
outlinks. It is therefore guaranteed that Equation 5 will converge
to a unique vector UPR*.
Definition (UPR): We define the usage-based PageRank (UPR)
of a web page pi as the n-th iteration of the following recursive
formula:
4. USAGE-BASED PAGERANK
So far, PageRank has been used in the context of web search,
either in its original form (assuming the user follows one of the
outgoing links with equal probabilities, or uniformly jumps to a
random page) or to its personalized form (changing the
personalization vector to favor certain pages). In our approach, we
use this algorithm in a totally different context, that of web
personalization. We introduce a hybrid PageRank-style algorithm
for ranking the pages of a web site in order to provide
recommendations to each visitor. For this reason, we bias the
computation using the knowledge acquired from previous users’
visits, as they are discovered from the user sessions recorded in
the web site’s logs. To achieve this, we define both the transition
matrix M and the personalization vector p in such way that pages
and paths previously visited by other users are preferred.
(6)
pk ∈Out ( p j )
where p is a non-negative N-vector whose elements sum to 1.
Usually m =
ij
∑ w j→ k
w j →i
UPR n ( pi ) = ε
⎛
⎜
⎜UPR n −1 ( p j ) ×
∑
p j ∈In ( pi ) ⎜
⎜
⎝
∑w
w j →i
j →k
pk ∈Out ( p j )
⎞
⎟
wi (8)
⎟ + (1 − ε )
⎟⎟
∑wj
p j ∈WS
⎠
Each iteration of UPR has complexity O(n2). The total complexity
is thus determined by the number of iterations, which in turn
depends on the size of the dataset. In practice, however,
PageRank (and accordingly UPR) gives good approximations
after 50 iterations for ε=0.85 (which is the most commonly used
value, recommended in [3]). The computations can be accelerated
by applying techniques such as those described in [14, 15] even
though it is not necessary in the proposed frameworks since UPR
is applied to a single web site, therefore it converges after a few
iterations.
5. UPR RECOMMENDATIONS
As already presented, UPR is a PageRank-style algorithm, based
on the usage data collected by previous users’ sessions. UPR is
applied to NG, a directed, weighted, strongly connected graph,
which represents the navigational graph, or a synopsis of it. In this
section, we present the process for creating the navigational tree
graph NtG from the web sessions, and how we can consecutively
create NG synopses from this tree. These synopses (which can be,
for example, Markov models of any order) are in turn used to
construct the personalized sub-graph prNG. Moreover, we present
l-UPR, a localized version of UPR, which is applied to this
personalized subset of the navigational graph.
5.1 Navigational Graph Construction
5.1.1 Navigational Tree (NtG)
The basis for the creation of prNG is a tree-like weighted graph,
further referred to as Navigational Tree (NtG), which represents
all the distinct paths followed by users and recorded in the web
logs. This structure has as root the special node R and all the other
nodes are instances of the M web pages of WS. The weighted
paths from the root towards the leaves represent all user sessions’
paths included in the web logs. All tree branches terminate in a
special leaf-node E denoting the end of a path. Note that in this
representation, there is replication of states in different parts of
the tree-like structure, and the structure fully describes the
information residing on web logs.
We assume that the log files have been preprocessed and
separated into distinct user sessions US, of length L. The
algorithm for creating the NtG is depicted in Figure 1. Briefly, for
every user session in the web logs, we create a path starting from
the root of the tree. If a subsequence of the session already exists
we update the weights of the respective edges, otherwise we
create a new branch, starting from the last common page visited
in the path. We assume that any consecutive pages’ repetitions
have been removed from the user sessions; on the other hand, we
keep any pages that have been visited more than once, but not
consecutively. We also denote the end of a session using a special
exit node.
Procedure CreateTree(U)
Input: User Sessions U
Output: Navigational Tree *NtG
1. root <- NtG;
2. tmpP <- root;
3. for every US∈U do
4. while US ≠ ∅ do
5. si = first_state(US);
6.
if parent(tmpP,si) then
7.
wtmpP,I = wtmpP,I + 1;
8.
tmpP <- si;
9.
US <- remove(US, si);
10. else
11.
addchild(tmpP,si);
12.
wtmpP,I = 1;
13.
tmpP <- si;
14.
US <- remove(US, si);
15. endif
16. if parent(tmpP,E) then
17.
wtmpP,E = wtmpP,E + 1;
18. else
19.
addchild(tmpP,E);
20.
wtmpP,E = 1;
21. endif
22. done
23. tmpP <- NtG;
24.done
Figure 1. NtG Creation Algorithm
In order to make this process clearer, we present a simple
example. Assume that the user sessions of a web site are those
included in Table 1. The Navigational Graph created after
applying the aforementioned algorithm is depicted in Figure 2.
NtG is used for creating any desired navigational graph synopsis
NG, which is in turn used for producing localized prNG, as we
will describe in the paragraphs that follow.
Table 1. User Sessions
User Session #
1
2
3
4
5
Path
aåbåcåd
aåbåeåd
aåcådåf
båcåbåg
båcåfåa
Figure 2. Tree-like Navigational Graph NtG
5.1.2 NG synopses
The tree structure created using the aforementioned algorithm has
a root node and weighted edges connecting the rest tree nodes.
Each web page in WS will have multiple occurrences as a node in
the tree, due to the fact that a page can be visited following
different paths. Our approach’s objective is to apply a localized
version of UPR to a fraction of NtG, based on the visits of each
incoming user. Performing this to NtG directly, however, would
be very expensive in computations and would require more time
than is allowed for an online process. Therefore, we create a
synopsis of NtG, which is subsequently used for extracting the
localized prNG. The more detailed the synopsis is, the more
accurate will be the representation of NtG. On the other hand, the
construction of a less detailed synopsis will save time and
computational power.
The simplest synopsis of NtG is a Markov Chain. Then, the edge
weight wiåj is computed as the sum of the weights of all directed
edges between nodes pi and pj, and the node weight wi is
computed as the sum of all the weights of edges pointing to the
multiple occurrences of pi:
wi =
∑w
k →i
5.2.1 The Personalized Navigational Graph (prNG)
(9)
k∈In ( pi )
This representation is simple to construct, depends, however, on
the assumption that the navigation is “memoryless”, in other
words that the next page to be visited by a user only depends on
the page he’s currently at. NG synopses that take into
consideration the “long-term memory” aspects of web surfing are
higher order Markov models, which can easily be constructed
from NtG by computing the k-step path frequencies (where k is
the order of the model). In Figure 3 we present the NG synopsis of
the NtG of Figure 2 if we choose to model the synopsis as a
Markov Chain. The number in parentheses in the nodes denote the
number of times the page was visited, whereas the edges’ weights
denote the times the respective path was visited. Nodes S and E
represent the start and end point respectively. These special nodes
are not used when applying UPR on the graph.
Figure 3. NG synopsis (Markov Chain)
In short, the process of constructing the personalized sub-graph is
as follows: We expand (part of) the path already visited by the
user, including all the outgoing links (i.e. the pages and the
respective weighted edges) existing in the NG synopsis. The
length of the path taken into consideration when expanding the
graph depends on the NG synopsis we have used (in the case of
Markov model synopses this represents the desired “memory” of
the system). We subsequently perform this operation for the new
pages (or paths), until we reach a predefined expansion depth. We
then remove any pages that already have been visited by the user,
since these don’t need to be included in the generated
recommendations. Their subsequent paths, however, are linked to
the rest sub-graph (consider this function as removing a node
from a linked list). This ensures that all the previously visited
pages by users having similar behavior will be kept in the final
sub-graph, without including any higher-level pages they might
have used as hubs for their navigation. After reaching the final set
of nodes, we normalize each node’s outgoing edge weights.
Before proceeding with the technical details of this algorithm, we
illustrate its functionality using two examples, based on the
sessions included in Table 1. In both examples we create the
prNGs for two user visits including the paths {a → b} and {b →
c}. In the first example, we assume that the sessions are modeled
using a Markov Chain NG synopsis. Using the path frequencies
for the one-step transitions, we expand the two paths, {a → b}
and {b → c}, to create the respective prNGs, as shown in Figure
4. The second example is based on a 2nd-order Markov model NG
synopsis. Note that in this case we also use the path frequencies of
two-step transitions. The corresponding prNGs for the two paths
are illustrated in Figure 5. The outgoing edge weights of each
node are normalized so that they sum to 1. We also observe that
the nodes included in each prNG depend on the NG synopsis we
choose to model the user sessions with.
5.2 Localized UPR (l-UPR)
In the previous section we presented UPR algorithm, which can
be applied to a web site in order to rank its web pages based both
on its link structure and the paths followed by previous users.
This process provides us with a “global” ranking of the web site’s
pages. In the context of web site personalization, however, we
want to “bias” this algorithm further, focusing on the path the
current visitor has followed, i.e. generating a “localized”
personalized ranking. Using the NG synopsis we have modeled
the user sessions with, we select a small subset of it based on the
current user’s path. This sub-graph includes all the subsequent (to
the current visit) pages visited by previous users until a
predefined path depth d. Therefore, it includes all the potential
“next” pages of the current user’s visit. l-UPR (localized UPR) is
in essence the application of UPR on this small, personalized
fraction of the navigational graph. The resulting ranking is used in
order to provide recommendations to the current visitor. This
approach is much faster than applying UPR to the NG synopsis
since the size of the graph is dramatically reduced, therefore
enabling online computations. Moreover, the ranking results are
personalized for each individual user, since they are based on the
current user’s visit and similar users’ behavior in the past. We
present the process of creating the personalized sub-graph prNG
and the recommendation process in more detail below.
Figure 4. prNG of Markov Chain NG synopsis
Figure 5. prNG of 2nd order Markov model NG synopsis
The prNG construction algorithm is presented in Figures 6 and 7.
The algorithm complexity depends on the synopsis used, since the
choice of the synopsis affects the time needed for locating the
successive pages for expanding the current path. It also depends
on the number of outgoing links of each sub-graph’s page and the
expansion depth, d. Therefore, if the complexity of locating
successive pages in a synopsis is k, the complexity of prNG
creation algorithm is O k * fanout ( NG ) d , where fanout(NG) is
the maximum number of a node’s outgoing links in NtG. In the
case of Markov model synopses, k=1 since the process of locating
the outgoing pages of a page or path reduces to the lookup in a
hash table.
(
)
Procedure Create_prNG(CV, NG)
Input: Current User Visit CV, Navigational
Graph NG
Output: Subset of NG prNG
1. start
2. CV = {vp};
3. cp = lastVisitedPath(CV);
4. expand(cp, NG, depth, expNG);
5. removeVisited(expNG, CV);
6. updateEdges(expNG);
7. prNG = normalize(expNG);
8. end
Figure 6. prNG Generation Algorithm
Procedure expand(cp, NG, d, eNG)
Input:last page/path visited cp, navigational
graph synopsis NG, depth of expansion d
Output: expanded navigational graph eNG
1. start
2. P := cp;
3. R:= rootNode(eNG);
4. tempd = 0;
5. addNode(eNG, R, cp);
6. while (tempd <= d)do
7. for every (p∈P of same level)do
8.
forevery np = linksto(NG, p, np, w)do
9.
addNode(enG, p, np, w);
10.
P += np;
11. done;
12. done;
13. tempd +=1;
14.done;
15.end
Figure 7. Path expansion sub-routine
Since the resulting prNG includes all possible “next” page visits
of the user, we then apply UPR in order to rank them and generate
personalized recommendations. The personalized navigational
sub-graph prNG should be built so as to retain the desirable
attributes for UPR to converge. The irreducibility of the subgraph is always satisfied since we have added the damping factor
(1-ε) in the rank propagation. Moreover, Equation 7 which states
that the sum of all outgoing edges’ weights of every node in the
sub-graph equals to 1, is satisfied since we normalize them. Note
here that prNG does not include any previously visited pages.
Definition (l-UPR): We define l-UPR(xi) of a page xi as the UPR
rank value of this page in the personalized sub-graph prNG.
These l-UPR rankings of the candidate pages are subsequently
used to generate a personalized recommendation set to each user.
This process is explained in more detail in the following Section.
5.2.2 UPR-based Personalized Recommendations
The application of UPR or l-UPR to the navigational graph results
in a ranked set of pages which are then used for
recommendations. As already presented, the final set of candidate
recommendation pages can be either personalized or global,
depending on the combination of algorithm - navigational graph
chosen:
1)
Apply l-UPR to prNG. Since prNG is a personalized fraction
of the NG synopsis, this approach results in a “personalized”
usage-based ranking of the pages most likely to be visited
next, based on the current user’s path.
2)
Apply UPR to NG synopsis. This approach results in a
“global” usage-based ranking of all the web site’s pages.
This global ranking can be used as an alternative in case
personalized
ranking
does
not
generate
any
recommendations. It can also be used for assigning page
probabilities in the context of other probabilistic prediction
frameworks, as we will describe in the Section that follows.
Finally, another consideration would be to have a pre-computed
set of recommendations for all popular paths in the web site, in
order to save time in online computations of the final
recommendation set.
6. EXPERIMENTAL EVALUATION
As already mentioned, UPR algorithm as well as the process of
creating personalized sub-graph synopses (prNG) and applying lUPR is orthogonal to the navigational graph synopsis we may
choose to model our data with. In this work, we choose the
Markov Chain to synopsize the Navigational Tree NtG. We
present here experimental results regarding the impact of
incorporating our proposed method in the Markov Chain
prediction model. More specifically, we select the top-n most
popular paths, as derived from the web logs. For these paths we
compute a recommendation set, using two variations of Markov
Chains and our algorithm applied on a Markov Chain synopsis.
We then compare the recommendation sets with the actual paths
followed by the users.
6.1 Experimental Setup
In our experiments we used two publicly available data sets. The
first one includes the page visits of users who visited the
“msnbc.com” web site on 28/9/99 [19]. The visits are recorded at
the level of URL category (for example sports, news, etc.). It
includes visits to 17 categories (i.e. 17 distinct pageviews). We
selected 96.000 distinct sessions including more than one and less
than 50 page visits per session and split them in two nonoverlapping time windows to form a training (65.000 sessions)
and a test (31.000 sessions) data set. The second data set includes
the sessionized data for the DePaul University CTI web server,
based on a random sample of users visiting the site for a two week
period during April 2002 [6]. The data set includes 683 distinct
pageviews and 13.745 distinct user sessions of length more than
one. We split the sessions in two non-overlapping time windows
to form a training (9.745 sessions) and a test (4.000 sessions) data
set. We will refer to these data sets as msnbc and cti data set
respectively. We chose to use these two data sets since they
present different characteristics in terms of web site context and
number of pageviews2. More specifically, msnbc includes the
visits to a very big portal. That means that the number of sessions,
as well as the length of paths is very large. This data set has
however the characteristic of very few pageviews, since the visits
are recorded at the level of page categories. We expect that the
visits to this web site are almost homogeneously distributed
among the 17 different categories. On the other hand, cti data set
refers to an academic web site. Visits to such sites are usually
categorized in two main groups: visits from students looking for
information concerning courses’ or administrative material, and
visits from researchers seeking information on papers, research
projects, etc. We expect that the recorded visits will imply this
categorization.
Since in all the experiments we created top-n rankings, in the
evaluation step we used two metrics commonly used for
comparing two top-n rankings r1 and r2. The first one, denoted as
OSim(r1,r2) [11] indicates the degree of overlap between the top-n
elements of two sets A and B (each one of size n) to be:
OSim ( r1 , r2 ) =
A∩ B
n
(10)
The second, KSim(r1,r2) is based on Kendall’s distance measure
[13] and indicates the degree to which the relative orderings of
two top-n lists are in agreement and is defined as:
KSim( r1 , r2 ) =
(u, v ) : r1 ' , r2 ' have same ordering of (u, v ), u ≠ v
(11)
A ∩ B ( A ∩ B − 1)
specifically, Total assigns prior page probabilities proportional to
the total page visits, whereas Start assigns prior page probabilities
proportional to the visits beginning with this page. The third
setup, referred to as l-Upr, is in essence our proposed algorithm
applied to a Markov Chain-based prNG. For the l-Upr setup, we
set the damping factor (1-ε) to 0.15 and the number of iterations
to 100 to ensure convergence. We expand each path to depth d=2.
The experimental scenario is as follows: We build the
navigational graph from the test data set and select the 10 most
popular paths comprising of two or more pages. For each such
path p, we make the assumption that it is the current path of the
user and generate recommendations applying the aforementioned
approaches on the training data set. Using the first two setups, we
find the n pages having higher probability to be visited after p. On
the other hand, using our approach, we expand p to create a
localized sub-graph and then apply l-UPR to rank the pages
included in it. We then select the top-n ranked pages. This process
results in three recommendation sets for each path p. At the same
time, we identify, in the test data set, the n most frequent paths
that extend p by one more page. We finally compare, for each
path p, the generated top-n page recommendations of each method
(Start, Total, l-Upr) with the n most frequent “next” pages, using
the OSim and KSim metrics.
We run the experiments generating top-3 and top-5
recommendation lists for each setup. We performed the
experiments using small recommendation sets because this
resembles more to what happens in reality, i.e. the system
recommends only a few “next” pages to the user. The diagrams
presented here, show the average OSim and KSim similarities over
all 10 paths.
msnbc data set - top 3 recommendations
Start
Total
l-Upr
1
average similarity
where r1’ is an extension of r1, containing all elements included in
r2 but not r1 at the end of the list (r2’ is defined analogously) [11].
In other words KSim takes into consideration only the common
items of the two lists, and computes how many pairs of them have
the same relative ordering in both lists. It is obvious that OSim is
more important (especially in small rankings) since it indicates
the concurrence of predicted pages with the actual visited ones.
On the other hand, KSim must be always evaluated in conjunction
with the respective OSim since it can take high values even when
only a few items are common in the two lists.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Osim
Ksim
6.2 Recommendations’ Accuracy Evaluation
We used 3 different setups for generating recommendations. The
first two, referred to as Start and Total, are the ones commonly
used in Markov models for computing prior probabilities. More
2
We should note at this point that there does not exist any
benchmark for web usage mining and personalization. We
therefore chose these two publicly available datasets which have
been used again in the past for experimentation in the web usage
mining and personalization context.
msnbc data set - top 5 recommendations
Start
average similarity
As already mentioned, the choice of the navigational graph
synopsis we use to model the user sessions is orthogonal to the lUPR framework. In this Section, we present results regarding the
impact of using our proposed method instead of pure usage-based
probabilistic models, focusing on Markov Chains.
Total
l-Upr
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Osim
Ksim
Figure 8. Average OSim and KSim of top-n rankings for
msnbc data set
Figure 8 depicts the average OSim and KSim values for the top-3
and top-5 rankings generated for the msnbc data set. In the first
case (top-3 page predictions) we observe that l-Upr behaves
slightly worse in terms of prediction accuracy (OSim) but all
methods achieve around 50% accuracy. The opposite result is
observed in the second case (top-5 page predictions), where l-Upr
behaves better in prediction accuracy than the other two methods,
and the overall prediction accuracy is more than average. In both
cases we observe a lower KSim, concluding that l-Upr managed to
predict the “next” pages but not in the same order (as they were
actually visited). As we mentioned earlier, however, the
presentation order is not so important in such a small
recommendation list. Overall, the differences between the three
methods are insignificant. This can be justified if we take into
account the nature of the data set used. As already mentioned, the
number of distinct pageviews of the data set is very small and
therefore the probability of coinciding in the predictions is the
same, irrespective of the method used.
Examining all findings in total, we verify our claim that l-UPR
performs the same as, or better than commonly used probabilistic
prediction methods. Even though the prediction accuracy in both
experiments is around 50%, we should point out that this value
represents the average OSim over 10 distinct top-n rankings.
Examining the similarities individually, we observed a big
variance in the findings, with some recommendation sets being
very similar to the actually visited pages (OSim > 70%), whereas
others being very dissimilar (OSim < 20%). Moreover, the NG
synopsis used in all three setups is the Markov Chain, which is the
simplest synopsis model, yet the less accurate one. We expect
better prediction accuracy if the algorithm is applied over a more
accurate NG synopsis and leave this open for future work.
In order to conclude on whether the number of distinct pageviews
is the one affecting the prediction accuracy of the three methods,
we performed the same experimental evaluation on the second
data set, cti. Figure 9 depicts the average OSim and KSim values
for the top-3 and top-5 rankings generated for the cti data set. We
observe that in both cases l-Upr outperforms the other two
methods both in terms of prediction accuracy (OSim) and relative
ordering (KSim). This finding supports our intuition, that in the
case of big web sites that have many pageviews, the incorporation
of structure data in the prediction process enhances the accuracy
of the recommendations.
7. CONCLUSIONS – FUTURE WORK
cti data set - top 3 recommendations
Start
Total
l-Upr
average similarity
0.6
Overall, taking into consideration the low complexity of the
proposed algorithm that enables the fast, online generation of
personalized recommendations, we conclude that it is a very
efficient alternative to pure usage-based methods.
There exist many recommendation models used for personalizing
a web site based on previous users’ navigational behavior. Most
of the models, however, are solely based on usage data and do not
take into consideration other characteristics of the web navigation,
such as link structure of a web site. In this paper we propose a
novel algorithm, UPR, which can be applied to any navigational
graph synopsis in order to quickly provide ranked personalized
recommendations to the visitors of a web site. The experiments
we have performed are more than promising. Our future plans
involve the application of l-UPR on different NG synopses. As
shown in the experimental evaluation, l-UPR is a very promising
recommendation algorithm. In our study we applied it on the
Markov Chain NG synopsis. We expect better results in the case
of more complex NG synopses, which approximate more
accurately the navigational graph.
0.5
0.4
8. REFERENCES
0.3
[1] M.S. Aktas, M.A. Nacar, F. Menczer, Personalizing
PageRank Based on Domain Profiles, in Proc. of WEBKDD
2004 Workshop, Seattle, 2004.
0.2
0.1
0
Osim
Ksim
cti data set - top 5 recommendations
Start
Total
[2] J. Borges, M. Levene, Data Mining of User Navigation
Patterns, in Revised Papers from the International Workshop
on Web Usage Analysis and User Profiling, LNCS Vol.
1836, pp.92-111, 2000
[3] S. Brin, L. Page, The anatomy of a large-scale hypertextual
Web search engine, Computer Networks, 30(1-7): 107-117,
1998
l-Upr
0.9
average similarity
0.8
[4] I. Cadez, S. Gaffney, P. Smyth, A general probabilistic
framework for clustering individuals and objects, in Proc. of
ACM KDD2000 Conference, Boston, 2000
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Osim
Ksim
Figure 9. Average OSim and KSim of top-n rankings for cti
data set
[5] I.Cadez, D.Heckerman, C.Meek, P. Smyth, S. White,
Visualization of Navigation Patterns on a Web Site Using
Model Based Clustering, in Proc. of ACM KDD2000
Conference, Boston MA, 2000
[6] CTI DePaul web server data,
http://maya.cs.depaul.edu/~classes/ect584/data/cti-data.zip
[7] M. Deshpande, G. Karypis, Selective Markov Models for
Predicting Web-Page Accesses, in Proc. of the 1st SIAM
International Conference on Data Mining, 2001
[8] M. Eirinaki, Web Mining: A Roadmap, Technical Report,
DB-NET 2004, available at http://www.db-net.aueb.gr
[9] M. Eirinaki, M. Vazirgiannis, Web Mining for Web
Personalization, in ACM Transactions on Internet
Technology (TOIT), 3(1):1-29, 2003
[10] M. Eirinaki, M. Vazirgiannis, I. Varlamis, SEWeP: Using
Site Semantics and a Taxonomy to Enhance the Web
Personalization Process, in Proc. of ACM KDD2003
Conference, Washington DC, 2003
[11] T. Haveliwala, Topic-Sensitive PageRank, in Proc. of
WWW2002 Conference, Hawaii, 2002
[12] Z. Huang, X. Li, H. Chen, Link Prediction Approach to
Collaborative Filtering, in Proc. of ACM JCDL’05, 2005
[13] M. Kendall, J.D.Gibbons, Rank Correlation Methods, Oxford
University Press, 1990
[14] S.D. Kamvar, T.H. Haveliwala, C.D. Manning, and G.H.
Golub, Extrapolation Methods for Accelerating PageRank
Computations, in Proc. of the 12th International World Wide
Web Conference, 2003
[15] S.D. Kamvar, T.H. Haveliwala, and G.H. Golub, Adaptive
Methods for the Computation of PageRank, in Proc. of the
International Conference on the Numerical Solution of
Markov Chains, 2003
[16] M. Levene, G. Loizou, Computing the Entropy of User
Navigation in the Web, in Intl. Journal of Information
Technology and Decision Making, 2:459-476, 2003
[17] E. Manavoglou, D. Pavlov, C.L. Giles, Probabilistic User
Behaviour Models, in Proc. of ICDM 2003
[18] R. Motwani and P. Raghavan. Randomized Algorithms,
Cambridge University Press, United Kingdom, 1995
[19] msnbc.com Web Log Data, available from UCI KDD
Archive,
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html
[20] M. Nakagawa, B. Mobasher, A Hybrid Web Personalization
Model Based on Site Connectivity, in Proc. of the 5th
WEBKDD Workshop, Washington DC, 2003
[21] N. Polyzotis, M. Garofalakis, Structure and Value Synopses
for XML Data Graphs, in Proc. of the 28th VLDB
Conference, 2002
[22] N. Polyzotis, M. Garofalakis, Y. Ioannidis, Approximate
XML Query Answers, in Proc. of SIGMOD 2004, Paris,
France, 2004
[23] M. Richardson, P. Domingos, The Intelligent Surfer:
Probabilistic Combination of Link and Content Information
in PageRank, in Neural Information Processing Systems, 14,
pp.1441-1448, 2002
[24] R.R. Sarukkai, Link Prediction and Path Analysis Using
Markov Chains, in Computer Networks, 33(1-6): 337-386,
2000
[25] R. Sen, M. Hansen, Predicting a Web user’s next access
based on log data, in Journal of Computational Graphics and
Statistics, 12(1):143-155, 2003
[26] J. Zhu, J. Hong, J. G. Hughes, Using Markov Models
for Web Site Link Prediction, in Proc. of ACM HT’02,
Maryland, 2002