Academia.eduAcademia.edu

UPR: Usage-based Page Ranking for Web Personalization

2006, … of the 5th IEEE International Conference …

Recommendation algorithms aim at proposing "next" pages to a user based on her navigational behavior. In the vast majority of related algorithms, only the usage data are used to produce recommendations. We claim that taking also into account the web structure and using link analysis algorithms ameliorates the quality of recommendations. In this paper we present UPR, a personalization algorithm which combines usage data and link analysis techniques for ranking and recommending web pages to the end user. Using the web site's structure and previously recorded user sessions we produce personalized navigational subgraphs (prNGs) to be used for applying UPR. Experimental results show that the accuracy of the generated recommendations is superior to pure usage-based approaches.

UPR: Usage-based Page Ranking for Web Personalization Magdalini Eirinaki Michalis Vazirgiannis Athens University of Economics and Business Dept. of Informatics Athens, Greece Athens University of Economics and Business Dept. of Informatics Athens, Greece [email protected] [email protected] ABSTRACT Recommendation algorithms aim at proposing “next” pages to a user based on her navigational behavior. In the vast majority of related algorithms, only the usage data are used to produce recommendations. We claim that taking also into account the web structure and using link analysis algorithms ameliorates the quality of recommendations. In this paper we present UPR, a personalization algorithm which combines usage data and link analysis techniques for ranking and recommending web pages to the end user. Using the web site’s structure and previously recorded user sessions we produce personalized navigational subgraphs (prNGs) to be used for applying UPR. Experimental results show that the accuracy of the generated recommendations is superior to pure usage-based approaches. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications – Data Mining; H.3.5 [Information Storage and Retrieval]: Online Information Services - Web-based services General Terms Algorithms, Experimentation. Keywords Web Personalization, Link Analysis, PageRank, Usage-based PageRank 1. INTRODUCTION The evolution of world wide web as the main information source for millions of people nowadays has imposed the need for new methods and algorithms that are able to process efficiently the vast amounts of data that reside on it. Users become more and more demanding in terms of the quality of information provided to them when searching the web or browsing a web site. The area of web mining, including any method that utilizes data residing on the web, namely usage, content and structure data, addresses this need. The most common applications involve the ranking of the results of a web search engine and the provision of recommendations to users of – usually commercial – web sites, known as web personalization. PageRank is the most popular link analysis algorithm, used in order to rank the results returned by a search engine after a user query. The ranking is performed by evaluating the importance of a page in terms of its connectivity to and from other important 1 1 pages. In the past there have been proposed many variations of this algorithm, aiming at refining the acquired results. Some of these approaches, make use of the so called “personalization vector” of PageRank in order to bias the results towards the individual needs of every user searching the web. In this work, we introduce PageRank in a totally different context, that of web personalization. Web personalization is defined as any action that adapts the information or services provided by a Web site to the needs of a user or a set of users, taking advantage of the knowledge gained from the users’ navigational behavior and individual interests, in combination with the content and the structure of the Web site [10]. In the past, many approaches have been proposed, based on pure web usage mining algorithms, markov models, or a combination of usage and content mining techniques. Motivated by the fact that in the context of navigating a web site, a page/path is important if many users have visited it before, we propose a novel approach that is based on a personalized version of PageRank, applied to the navigational tree created by the previous users’ navigations. We personalize PageRank by biasing it to “favor” pages and paths previously preferred by many users. We prove that this hybrid algorithm can be applied to any web site’s navigational graph as long as it satisfies certain properties. Thus, it is orthogonal to any graph synopsis we may choose to model the user sessions with, such as a Markov Chain, higherorder Markov models, tree-like synopses, etc. This approach is therefore generic, proved to converge after a few iterations and thus provides fast results, whereas we can fluctuate between simplicity and accuracy by applying it to less or more complex web navigational graph models. More specifically, our key contributions are: • UPR, a novel usage-based personalized PageRank-style algorithm used for ranking the web pages of a site based on previous users’ navigational behavior. • A method for creating personalized navigational graph synopses (prNG) to be used for applying UPR. • A personalization method which combines usage and structure data for ranking and recommending web pages to the end user. • A set of experimental results which prove that the incorporation of link analysis in the web personalization process improves the recommendations’ accuracy. A modified version of this paper has appeared in the Proceedings of the 5th IEEE International Conference on Data Mining (ICDM ’05) The rest of this paper is organized as follows: In Section 2 we overview some related work in the areas of PageRank-based algorithms, web usage mining and personalization. In Section 3 we overview the basic properties and functionality of the PageRank algorithm, whereas in Section 4 we present our novel algorithm, UPR and prove its convergence. Section 5 describes the method of creating the Navigational Graph NtG and localized (personalized) Navigational Graph synopses prNG in order to provide personalized recommendations by applying UPR. Section 6 includes some experimental evaluation of the proposed approach, whereas in Section 7 we conclude with insights to our plans for future work. 2. RELATED WORK Although the connectivity features of the web graph have been extensively used for personalizing web search results [1, 11, 23], only a few approaches exist that take them into consideration in the web site personalization process. Zhu et. al. [26] use citation and coupling network analysis techniques in order to conceptually cluster the pages of a web site. The proposed recommendation system is based on Markov models. Nakagawa and Mobasher [20] use the degree of connectivity between the pages of a web site as the determinant factor for switching among recommendation models based on either frequent itemset mining or sequential pattern discovery. Nevertheless, none of the aforementioned approaches fully integrates link analysis techniques in the web personalization process by exploiting the notion of the authority or importance of a web page in the web graph. In a very recent work, Huang et. al. [12] address the data sparsity problem of collaborative filtering systems by creating a bipartite graph and calculating linkage measures between unconnected pairs for selecting candidates and make recommendations. In this study the graph nodes represent both users and rated/purchased items. Finally, subsequent to our work, Borges and Levene [BL06] proposed independently two link analysis ranking methods, SiteRank and PopularityRank which are the PageRank algorithm and a personalized variation of it applied to a web site graph. This work focuses on the comparison of the distributions and the rankings of the two methods rather than proposing a web personalization algorithm. The authors’ concluding remarks, that the topology of the web site is very important and should be taken into consideration in the web personalization process, further support our claim. To the extent of our knowledge, however, this is the first approach that uses a personalized usage-biased PageRank-like algorithm in the context of web personalization. In our approach, we propose the use of web navigational graph synopses, such as Markov models. In the past, many researches have proposed the use of 1st order (Markov Chains) [2, 5, 24], higher-order [16], or hybrid [4, 7, 17, 25] Markov models. Even though Markov Chains provide a simple way to capture sequential dependence, they do not take into consideration the “long-term memory” aspects of web surfing behavior. Higher-order Markov models are more accurate for predicting navigational paths, there exists, however, a trade-off between improved coverage and exponential increase in state-space complexity as the order increases. What's more, such complex models often require inordinate amounts of training data. Moreover the hybrid models, that combine different order Markov models, require much more resources in terms of preprocessing and training. We should stress at this point that our approach is orthogonal to the type of synopsis one may choose. This synopsis may be a Markov model of any order (depending on the simplicity and accuracy required), or any other graph synopsis, such as those proposed in [21, 22]. Apart from Markov models, there also exist many approaches that perform web usage mining for web personalization, based on association rules mining, clustering, sequential pattern discovery, frequent pattern discovery or collaborative filtering. Since this is out of the scope of this paper, we refer the reader to [8, 9] for an extensive overview of such approaches. 3. PRELIMINARIES & REVIEW OF PAGERANK The PageRank algorithm [3] is the most popular link analysis algorithm, used broadly for assigning numerical weightings to web documents and used from web search engines in order to rank the retrieved results. The algorithm models the behavior of a random surfer, who either chooses an outgoing link from the page he’s currently at, or “jumps” to a random page after a few clicks. The PageRank of a page is defined as the probability of the random surfer being at some particular time step k > K at this page. This probability is correlated with the importance of this page, as it is defined based on the number and the importance of the pages pointing to it. For sufficiently large K this probability is unique, as illustrated in what follows. Consider the web as a directed graph G, where the N nodes represent the web pages and the edges the links between them. The random walk on G induces a Markov Chain where the states are given by the nodes in G, and M is the stochastic transition matrix with mij describing the one-step transition from page j to page i. The adjacency function mij is 0 if there is no direct link from pj to pi, and normalized such that, for each j: ∑m N ij =1 (1) i =1 As stated by the Perron-Frobenius theorem, if M is irreducible (i.e. G is strongly connected) and aperiodic, then Mk (i.e. the transition matrix for the k-step transition) converges to a matrix in which each column is the unique stationary distribution PR ∗ , independent of the initial distribution PR. The stationary distribution is the vector which satisfies the equation: PR * = M × PR * (2) in other words PR is the dominant eigenvector of the matrix M. ∗ Since M is the stochastic transition matrix over the web graph G, PageRank is in essence the stationary probability distribution over pages induced by a random walk on the web. As already implied, the convergence of PageRank is guaranteed only if M is irreducible and aperiodic [18]. The aperiodicity constraint is guaranteed in practice in the web context, whereas the irreducibility is satisfied by adding a damping factor (1-ε) to the rank propagation (ε is a very small number, usually set to 0.15), in order to limit the effect of rank sinks and guarantee convergence to a unique vector. We therefore define a new matrix M’ by adding low-probability transition edges between all nodes in G (we should point out that in order to ensure that M’ is irreducible, we should remove any dangling nodes, i.e. nodes with outdegree 0): M ' = (1 − ε ) M + εu (3) In other words, the user may choose a random destination based on the probability distribution of u. This process is also known as teleportation. PageRank can then be expressed as the unique solution to Equation 2, if we substitute M with M’: PR = (1 − ε ) M × PR + εp mij = ∑p 1 k pk ∈Out ( p j ) 1 and u = ⎡ ⎤ ⎢⎣ N ⎥⎦ N ×N otherwise. The personalization vector p is defined as ⎤ ⎡ ⎢ wi ⎥ p=⎢ ⎥ ⎢ ∑ wj ⎥ p WS ∈ j ⎦ N ×1 ⎣ (4) , i.e. the probability ⎡1⎤ p = ⎢ ⎥ . By choosing, however, u, and consequently p, to ⎣ N ⎦ N ×1 of randomly jumping to another page is uniform. In that case, follow a non-uniform distribution, we essentially bias the resultant PageRank vector computation to favor certain pages. Thus, u is usually referred to as the personalization vector. This approach is broadly used in the web search engines’ context, where the ranking of the retrieved results are biased by favoring pages relevant to the query terms, or the user preferences to certain topic categories [1, 11, 23]. In what follows, we present a usage-biased version of PageRank algorithm, used for ranking the pages of a web site based on the navigational behavior of previous visitors. We consider the directed navigational graph NG, where the nodes represent the web pages of the web site and the edges represent the consecutive one-step paths followed by previous users. Both nodes and edges carry weights. The weight wi on each node represents the number of times page pi was visited and the weight wjåi on each edge represents the number of times pi was visited right after pj. Following the aforementioned properties of the Markov theory and the PageRank computation, the usage-based PagerRank vector UPR is the solution to the following equation: UPR = (1 − ε ) M × UPR + εp (5) The transition matrix M on NG is defined as the square NxN matrix whose elements mij equal to 0 if there does not exist a link (i.e. visit) from page pj to pi and (7) Using the aforementioned formulas, we bias the PageRank calculation to rank higher pages that were visited more often by previous users. We then use this hybrid ranking, combining structure and usage data of the site, to provide a ranked recommendation set to current users. Note that Equation 1 holds, that is, M is normalized such that the sum of its columns equal to 1, therefore M is a stochastic transition matrix, as required for the convergence condition of the algorithm to hold. M is as already mentioned aperiodic in the web context and irreducible, since we have included the damping factor (1-ε). We also eliminate any dangling pages by adding links to all other pages with uniform probabilities for all nodes with no outlinks. It is therefore guaranteed that Equation 5 will converge to a unique vector UPR*. Definition (UPR): We define the usage-based PageRank (UPR) of a web page pi as the n-th iteration of the following recursive formula: 4. USAGE-BASED PAGERANK So far, PageRank has been used in the context of web search, either in its original form (assuming the user follows one of the outgoing links with equal probabilities, or uniformly jumps to a random page) or to its personalized form (changing the personalization vector to favor certain pages). In our approach, we use this algorithm in a totally different context, that of web personalization. We introduce a hybrid PageRank-style algorithm for ranking the pages of a web site in order to provide recommendations to each visitor. For this reason, we bias the computation using the knowledge acquired from previous users’ visits, as they are discovered from the user sessions recorded in the web site’s logs. To achieve this, we define both the transition matrix M and the personalization vector p in such way that pages and paths previously visited by other users are preferred. (6) pk ∈Out ( p j ) where p is a non-negative N-vector whose elements sum to 1. Usually m = ij ∑ w j→ k w j →i UPR n ( pi ) = ε ⎛ ⎜ ⎜UPR n −1 ( p j ) × ∑ p j ∈In ( pi ) ⎜ ⎜ ⎝ ∑w w j →i j →k pk ∈Out ( p j ) ⎞ ⎟ wi (8) ⎟ + (1 − ε ) ⎟⎟ ∑wj p j ∈WS ⎠ Each iteration of UPR has complexity O(n2). The total complexity is thus determined by the number of iterations, which in turn depends on the size of the dataset. In practice, however, PageRank (and accordingly UPR) gives good approximations after 50 iterations for ε=0.85 (which is the most commonly used value, recommended in [3]). The computations can be accelerated by applying techniques such as those described in [14, 15] even though it is not necessary in the proposed frameworks since UPR is applied to a single web site, therefore it converges after a few iterations. 5. UPR RECOMMENDATIONS As already presented, UPR is a PageRank-style algorithm, based on the usage data collected by previous users’ sessions. UPR is applied to NG, a directed, weighted, strongly connected graph, which represents the navigational graph, or a synopsis of it. In this section, we present the process for creating the navigational tree graph NtG from the web sessions, and how we can consecutively create NG synopses from this tree. These synopses (which can be, for example, Markov models of any order) are in turn used to construct the personalized sub-graph prNG. Moreover, we present l-UPR, a localized version of UPR, which is applied to this personalized subset of the navigational graph. 5.1 Navigational Graph Construction 5.1.1 Navigational Tree (NtG) The basis for the creation of prNG is a tree-like weighted graph, further referred to as Navigational Tree (NtG), which represents all the distinct paths followed by users and recorded in the web logs. This structure has as root the special node R and all the other nodes are instances of the M web pages of WS. The weighted paths from the root towards the leaves represent all user sessions’ paths included in the web logs. All tree branches terminate in a special leaf-node E denoting the end of a path. Note that in this representation, there is replication of states in different parts of the tree-like structure, and the structure fully describes the information residing on web logs. We assume that the log files have been preprocessed and separated into distinct user sessions US, of length L. The algorithm for creating the NtG is depicted in Figure 1. Briefly, for every user session in the web logs, we create a path starting from the root of the tree. If a subsequence of the session already exists we update the weights of the respective edges, otherwise we create a new branch, starting from the last common page visited in the path. We assume that any consecutive pages’ repetitions have been removed from the user sessions; on the other hand, we keep any pages that have been visited more than once, but not consecutively. We also denote the end of a session using a special exit node. Procedure CreateTree(U) Input: User Sessions U Output: Navigational Tree *NtG 1. root <- NtG; 2. tmpP <- root; 3. for every US∈U do 4. while US ≠ ∅ do 5. si = first_state(US); 6. if parent(tmpP,si) then 7. wtmpP,I = wtmpP,I + 1; 8. tmpP <- si; 9. US <- remove(US, si); 10. else 11. addchild(tmpP,si); 12. wtmpP,I = 1; 13. tmpP <- si; 14. US <- remove(US, si); 15. endif 16. if parent(tmpP,E) then 17. wtmpP,E = wtmpP,E + 1; 18. else 19. addchild(tmpP,E); 20. wtmpP,E = 1; 21. endif 22. done 23. tmpP <- NtG; 24.done Figure 1. NtG Creation Algorithm In order to make this process clearer, we present a simple example. Assume that the user sessions of a web site are those included in Table 1. The Navigational Graph created after applying the aforementioned algorithm is depicted in Figure 2. NtG is used for creating any desired navigational graph synopsis NG, which is in turn used for producing localized prNG, as we will describe in the paragraphs that follow. Table 1. User Sessions User Session # 1 2 3 4 5 Path aåbåcåd aåbåeåd aåcådåf båcåbåg båcåfåa Figure 2. Tree-like Navigational Graph NtG 5.1.2 NG synopses The tree structure created using the aforementioned algorithm has a root node and weighted edges connecting the rest tree nodes. Each web page in WS will have multiple occurrences as a node in the tree, due to the fact that a page can be visited following different paths. Our approach’s objective is to apply a localized version of UPR to a fraction of NtG, based on the visits of each incoming user. Performing this to NtG directly, however, would be very expensive in computations and would require more time than is allowed for an online process. Therefore, we create a synopsis of NtG, which is subsequently used for extracting the localized prNG. The more detailed the synopsis is, the more accurate will be the representation of NtG. On the other hand, the construction of a less detailed synopsis will save time and computational power. The simplest synopsis of NtG is a Markov Chain. Then, the edge weight wiåj is computed as the sum of the weights of all directed edges between nodes pi and pj, and the node weight wi is computed as the sum of all the weights of edges pointing to the multiple occurrences of pi: wi = ∑w k →i 5.2.1 The Personalized Navigational Graph (prNG) (9) k∈In ( pi ) This representation is simple to construct, depends, however, on the assumption that the navigation is “memoryless”, in other words that the next page to be visited by a user only depends on the page he’s currently at. NG synopses that take into consideration the “long-term memory” aspects of web surfing are higher order Markov models, which can easily be constructed from NtG by computing the k-step path frequencies (where k is the order of the model). In Figure 3 we present the NG synopsis of the NtG of Figure 2 if we choose to model the synopsis as a Markov Chain. The number in parentheses in the nodes denote the number of times the page was visited, whereas the edges’ weights denote the times the respective path was visited. Nodes S and E represent the start and end point respectively. These special nodes are not used when applying UPR on the graph. Figure 3. NG synopsis (Markov Chain) In short, the process of constructing the personalized sub-graph is as follows: We expand (part of) the path already visited by the user, including all the outgoing links (i.e. the pages and the respective weighted edges) existing in the NG synopsis. The length of the path taken into consideration when expanding the graph depends on the NG synopsis we have used (in the case of Markov model synopses this represents the desired “memory” of the system). We subsequently perform this operation for the new pages (or paths), until we reach a predefined expansion depth. We then remove any pages that already have been visited by the user, since these don’t need to be included in the generated recommendations. Their subsequent paths, however, are linked to the rest sub-graph (consider this function as removing a node from a linked list). This ensures that all the previously visited pages by users having similar behavior will be kept in the final sub-graph, without including any higher-level pages they might have used as hubs for their navigation. After reaching the final set of nodes, we normalize each node’s outgoing edge weights. Before proceeding with the technical details of this algorithm, we illustrate its functionality using two examples, based on the sessions included in Table 1. In both examples we create the prNGs for two user visits including the paths {a → b} and {b → c}. In the first example, we assume that the sessions are modeled using a Markov Chain NG synopsis. Using the path frequencies for the one-step transitions, we expand the two paths, {a → b} and {b → c}, to create the respective prNGs, as shown in Figure 4. The second example is based on a 2nd-order Markov model NG synopsis. Note that in this case we also use the path frequencies of two-step transitions. The corresponding prNGs for the two paths are illustrated in Figure 5. The outgoing edge weights of each node are normalized so that they sum to 1. We also observe that the nodes included in each prNG depend on the NG synopsis we choose to model the user sessions with. 5.2 Localized UPR (l-UPR) In the previous section we presented UPR algorithm, which can be applied to a web site in order to rank its web pages based both on its link structure and the paths followed by previous users. This process provides us with a “global” ranking of the web site’s pages. In the context of web site personalization, however, we want to “bias” this algorithm further, focusing on the path the current visitor has followed, i.e. generating a “localized” personalized ranking. Using the NG synopsis we have modeled the user sessions with, we select a small subset of it based on the current user’s path. This sub-graph includes all the subsequent (to the current visit) pages visited by previous users until a predefined path depth d. Therefore, it includes all the potential “next” pages of the current user’s visit. l-UPR (localized UPR) is in essence the application of UPR on this small, personalized fraction of the navigational graph. The resulting ranking is used in order to provide recommendations to the current visitor. This approach is much faster than applying UPR to the NG synopsis since the size of the graph is dramatically reduced, therefore enabling online computations. Moreover, the ranking results are personalized for each individual user, since they are based on the current user’s visit and similar users’ behavior in the past. We present the process of creating the personalized sub-graph prNG and the recommendation process in more detail below. Figure 4. prNG of Markov Chain NG synopsis Figure 5. prNG of 2nd order Markov model NG synopsis The prNG construction algorithm is presented in Figures 6 and 7. The algorithm complexity depends on the synopsis used, since the choice of the synopsis affects the time needed for locating the successive pages for expanding the current path. It also depends on the number of outgoing links of each sub-graph’s page and the expansion depth, d. Therefore, if the complexity of locating successive pages in a synopsis is k, the complexity of prNG creation algorithm is O k * fanout ( NG ) d , where fanout(NG) is the maximum number of a node’s outgoing links in NtG. In the case of Markov model synopses, k=1 since the process of locating the outgoing pages of a page or path reduces to the lookup in a hash table. ( ) Procedure Create_prNG(CV, NG) Input: Current User Visit CV, Navigational Graph NG Output: Subset of NG prNG 1. start 2. CV = {vp}; 3. cp = lastVisitedPath(CV); 4. expand(cp, NG, depth, expNG); 5. removeVisited(expNG, CV); 6. updateEdges(expNG); 7. prNG = normalize(expNG); 8. end Figure 6. prNG Generation Algorithm Procedure expand(cp, NG, d, eNG) Input:last page/path visited cp, navigational graph synopsis NG, depth of expansion d Output: expanded navigational graph eNG 1. start 2. P := cp; 3. R:= rootNode(eNG); 4. tempd = 0; 5. addNode(eNG, R, cp); 6. while (tempd <= d)do 7. for every (p∈P of same level)do 8. forevery np = linksto(NG, p, np, w)do 9. addNode(enG, p, np, w); 10. P += np; 11. done; 12. done; 13. tempd +=1; 14.done; 15.end Figure 7. Path expansion sub-routine Since the resulting prNG includes all possible “next” page visits of the user, we then apply UPR in order to rank them and generate personalized recommendations. The personalized navigational sub-graph prNG should be built so as to retain the desirable attributes for UPR to converge. The irreducibility of the subgraph is always satisfied since we have added the damping factor (1-ε) in the rank propagation. Moreover, Equation 7 which states that the sum of all outgoing edges’ weights of every node in the sub-graph equals to 1, is satisfied since we normalize them. Note here that prNG does not include any previously visited pages. Definition (l-UPR): We define l-UPR(xi) of a page xi as the UPR rank value of this page in the personalized sub-graph prNG. These l-UPR rankings of the candidate pages are subsequently used to generate a personalized recommendation set to each user. This process is explained in more detail in the following Section. 5.2.2 UPR-based Personalized Recommendations The application of UPR or l-UPR to the navigational graph results in a ranked set of pages which are then used for recommendations. As already presented, the final set of candidate recommendation pages can be either personalized or global, depending on the combination of algorithm - navigational graph chosen: 1) Apply l-UPR to prNG. Since prNG is a personalized fraction of the NG synopsis, this approach results in a “personalized” usage-based ranking of the pages most likely to be visited next, based on the current user’s path. 2) Apply UPR to NG synopsis. This approach results in a “global” usage-based ranking of all the web site’s pages. This global ranking can be used as an alternative in case personalized ranking does not generate any recommendations. It can also be used for assigning page probabilities in the context of other probabilistic prediction frameworks, as we will describe in the Section that follows. Finally, another consideration would be to have a pre-computed set of recommendations for all popular paths in the web site, in order to save time in online computations of the final recommendation set. 6. EXPERIMENTAL EVALUATION As already mentioned, UPR algorithm as well as the process of creating personalized sub-graph synopses (prNG) and applying lUPR is orthogonal to the navigational graph synopsis we may choose to model our data with. In this work, we choose the Markov Chain to synopsize the Navigational Tree NtG. We present here experimental results regarding the impact of incorporating our proposed method in the Markov Chain prediction model. More specifically, we select the top-n most popular paths, as derived from the web logs. For these paths we compute a recommendation set, using two variations of Markov Chains and our algorithm applied on a Markov Chain synopsis. We then compare the recommendation sets with the actual paths followed by the users. 6.1 Experimental Setup In our experiments we used two publicly available data sets. The first one includes the page visits of users who visited the “msnbc.com” web site on 28/9/99 [19]. The visits are recorded at the level of URL category (for example sports, news, etc.). It includes visits to 17 categories (i.e. 17 distinct pageviews). We selected 96.000 distinct sessions including more than one and less than 50 page visits per session and split them in two nonoverlapping time windows to form a training (65.000 sessions) and a test (31.000 sessions) data set. The second data set includes the sessionized data for the DePaul University CTI web server, based on a random sample of users visiting the site for a two week period during April 2002 [6]. The data set includes 683 distinct pageviews and 13.745 distinct user sessions of length more than one. We split the sessions in two non-overlapping time windows to form a training (9.745 sessions) and a test (4.000 sessions) data set. We will refer to these data sets as msnbc and cti data set respectively. We chose to use these two data sets since they present different characteristics in terms of web site context and number of pageviews2. More specifically, msnbc includes the visits to a very big portal. That means that the number of sessions, as well as the length of paths is very large. This data set has however the characteristic of very few pageviews, since the visits are recorded at the level of page categories. We expect that the visits to this web site are almost homogeneously distributed among the 17 different categories. On the other hand, cti data set refers to an academic web site. Visits to such sites are usually categorized in two main groups: visits from students looking for information concerning courses’ or administrative material, and visits from researchers seeking information on papers, research projects, etc. We expect that the recorded visits will imply this categorization. Since in all the experiments we created top-n rankings, in the evaluation step we used two metrics commonly used for comparing two top-n rankings r1 and r2. The first one, denoted as OSim(r1,r2) [11] indicates the degree of overlap between the top-n elements of two sets A and B (each one of size n) to be: OSim ( r1 , r2 ) = A∩ B n (10) The second, KSim(r1,r2) is based on Kendall’s distance measure [13] and indicates the degree to which the relative orderings of two top-n lists are in agreement and is defined as: KSim( r1 , r2 ) = (u, v ) : r1 ' , r2 ' have same ordering of (u, v ), u ≠ v (11) A ∩ B ( A ∩ B − 1) specifically, Total assigns prior page probabilities proportional to the total page visits, whereas Start assigns prior page probabilities proportional to the visits beginning with this page. The third setup, referred to as l-Upr, is in essence our proposed algorithm applied to a Markov Chain-based prNG. For the l-Upr setup, we set the damping factor (1-ε) to 0.15 and the number of iterations to 100 to ensure convergence. We expand each path to depth d=2. The experimental scenario is as follows: We build the navigational graph from the test data set and select the 10 most popular paths comprising of two or more pages. For each such path p, we make the assumption that it is the current path of the user and generate recommendations applying the aforementioned approaches on the training data set. Using the first two setups, we find the n pages having higher probability to be visited after p. On the other hand, using our approach, we expand p to create a localized sub-graph and then apply l-UPR to rank the pages included in it. We then select the top-n ranked pages. This process results in three recommendation sets for each path p. At the same time, we identify, in the test data set, the n most frequent paths that extend p by one more page. We finally compare, for each path p, the generated top-n page recommendations of each method (Start, Total, l-Upr) with the n most frequent “next” pages, using the OSim and KSim metrics. We run the experiments generating top-3 and top-5 recommendation lists for each setup. We performed the experiments using small recommendation sets because this resembles more to what happens in reality, i.e. the system recommends only a few “next” pages to the user. The diagrams presented here, show the average OSim and KSim similarities over all 10 paths. msnbc data set - top 3 recommendations Start Total l-Upr 1 average similarity where r1’ is an extension of r1, containing all elements included in r2 but not r1 at the end of the list (r2’ is defined analogously) [11]. In other words KSim takes into consideration only the common items of the two lists, and computes how many pairs of them have the same relative ordering in both lists. It is obvious that OSim is more important (especially in small rankings) since it indicates the concurrence of predicted pages with the actual visited ones. On the other hand, KSim must be always evaluated in conjunction with the respective OSim since it can take high values even when only a few items are common in the two lists. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Osim Ksim 6.2 Recommendations’ Accuracy Evaluation We used 3 different setups for generating recommendations. The first two, referred to as Start and Total, are the ones commonly used in Markov models for computing prior probabilities. More 2 We should note at this point that there does not exist any benchmark for web usage mining and personalization. We therefore chose these two publicly available datasets which have been used again in the past for experimentation in the web usage mining and personalization context. msnbc data set - top 5 recommendations Start average similarity As already mentioned, the choice of the navigational graph synopsis we use to model the user sessions is orthogonal to the lUPR framework. In this Section, we present results regarding the impact of using our proposed method instead of pure usage-based probabilistic models, focusing on Markov Chains. Total l-Upr 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Osim Ksim Figure 8. Average OSim and KSim of top-n rankings for msnbc data set Figure 8 depicts the average OSim and KSim values for the top-3 and top-5 rankings generated for the msnbc data set. In the first case (top-3 page predictions) we observe that l-Upr behaves slightly worse in terms of prediction accuracy (OSim) but all methods achieve around 50% accuracy. The opposite result is observed in the second case (top-5 page predictions), where l-Upr behaves better in prediction accuracy than the other two methods, and the overall prediction accuracy is more than average. In both cases we observe a lower KSim, concluding that l-Upr managed to predict the “next” pages but not in the same order (as they were actually visited). As we mentioned earlier, however, the presentation order is not so important in such a small recommendation list. Overall, the differences between the three methods are insignificant. This can be justified if we take into account the nature of the data set used. As already mentioned, the number of distinct pageviews of the data set is very small and therefore the probability of coinciding in the predictions is the same, irrespective of the method used. Examining all findings in total, we verify our claim that l-UPR performs the same as, or better than commonly used probabilistic prediction methods. Even though the prediction accuracy in both experiments is around 50%, we should point out that this value represents the average OSim over 10 distinct top-n rankings. Examining the similarities individually, we observed a big variance in the findings, with some recommendation sets being very similar to the actually visited pages (OSim > 70%), whereas others being very dissimilar (OSim < 20%). Moreover, the NG synopsis used in all three setups is the Markov Chain, which is the simplest synopsis model, yet the less accurate one. We expect better prediction accuracy if the algorithm is applied over a more accurate NG synopsis and leave this open for future work. In order to conclude on whether the number of distinct pageviews is the one affecting the prediction accuracy of the three methods, we performed the same experimental evaluation on the second data set, cti. Figure 9 depicts the average OSim and KSim values for the top-3 and top-5 rankings generated for the cti data set. We observe that in both cases l-Upr outperforms the other two methods both in terms of prediction accuracy (OSim) and relative ordering (KSim). This finding supports our intuition, that in the case of big web sites that have many pageviews, the incorporation of structure data in the prediction process enhances the accuracy of the recommendations. 7. CONCLUSIONS – FUTURE WORK cti data set - top 3 recommendations Start Total l-Upr average similarity 0.6 Overall, taking into consideration the low complexity of the proposed algorithm that enables the fast, online generation of personalized recommendations, we conclude that it is a very efficient alternative to pure usage-based methods. There exist many recommendation models used for personalizing a web site based on previous users’ navigational behavior. Most of the models, however, are solely based on usage data and do not take into consideration other characteristics of the web navigation, such as link structure of a web site. In this paper we propose a novel algorithm, UPR, which can be applied to any navigational graph synopsis in order to quickly provide ranked personalized recommendations to the visitors of a web site. The experiments we have performed are more than promising. Our future plans involve the application of l-UPR on different NG synopses. As shown in the experimental evaluation, l-UPR is a very promising recommendation algorithm. In our study we applied it on the Markov Chain NG synopsis. We expect better results in the case of more complex NG synopses, which approximate more accurately the navigational graph. 0.5 0.4 8. REFERENCES 0.3 [1] M.S. Aktas, M.A. Nacar, F. Menczer, Personalizing PageRank Based on Domain Profiles, in Proc. of WEBKDD 2004 Workshop, Seattle, 2004. 0.2 0.1 0 Osim Ksim cti data set - top 5 recommendations Start Total [2] J. Borges, M. Levene, Data Mining of User Navigation Patterns, in Revised Papers from the International Workshop on Web Usage Analysis and User Profiling, LNCS Vol. 1836, pp.92-111, 2000 [3] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine, Computer Networks, 30(1-7): 107-117, 1998 l-Upr 0.9 average similarity 0.8 [4] I. Cadez, S. Gaffney, P. Smyth, A general probabilistic framework for clustering individuals and objects, in Proc. of ACM KDD2000 Conference, Boston, 2000 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Osim Ksim Figure 9. Average OSim and KSim of top-n rankings for cti data set [5] I.Cadez, D.Heckerman, C.Meek, P. Smyth, S. White, Visualization of Navigation Patterns on a Web Site Using Model Based Clustering, in Proc. of ACM KDD2000 Conference, Boston MA, 2000 [6] CTI DePaul web server data, http://maya.cs.depaul.edu/~classes/ect584/data/cti-data.zip [7] M. Deshpande, G. Karypis, Selective Markov Models for Predicting Web-Page Accesses, in Proc. of the 1st SIAM International Conference on Data Mining, 2001 [8] M. Eirinaki, Web Mining: A Roadmap, Technical Report, DB-NET 2004, available at http://www.db-net.aueb.gr [9] M. Eirinaki, M. Vazirgiannis, Web Mining for Web Personalization, in ACM Transactions on Internet Technology (TOIT), 3(1):1-29, 2003 [10] M. Eirinaki, M. Vazirgiannis, I. Varlamis, SEWeP: Using Site Semantics and a Taxonomy to Enhance the Web Personalization Process, in Proc. of ACM KDD2003 Conference, Washington DC, 2003 [11] T. Haveliwala, Topic-Sensitive PageRank, in Proc. of WWW2002 Conference, Hawaii, 2002 [12] Z. Huang, X. Li, H. Chen, Link Prediction Approach to Collaborative Filtering, in Proc. of ACM JCDL’05, 2005 [13] M. Kendall, J.D.Gibbons, Rank Correlation Methods, Oxford University Press, 1990 [14] S.D. Kamvar, T.H. Haveliwala, C.D. Manning, and G.H. Golub, Extrapolation Methods for Accelerating PageRank Computations, in Proc. of the 12th International World Wide Web Conference, 2003 [15] S.D. Kamvar, T.H. Haveliwala, and G.H. Golub, Adaptive Methods for the Computation of PageRank, in Proc. of the International Conference on the Numerical Solution of Markov Chains, 2003 [16] M. Levene, G. Loizou, Computing the Entropy of User Navigation in the Web, in Intl. Journal of Information Technology and Decision Making, 2:459-476, 2003 [17] E. Manavoglou, D. Pavlov, C.L. Giles, Probabilistic User Behaviour Models, in Proc. of ICDM 2003 [18] R. Motwani and P. Raghavan. Randomized Algorithms, Cambridge University Press, United Kingdom, 1995 [19] msnbc.com Web Log Data, available from UCI KDD Archive, http://kdd.ics.uci.edu/databases/msnbc/msnbc.html [20] M. Nakagawa, B. Mobasher, A Hybrid Web Personalization Model Based on Site Connectivity, in Proc. of the 5th WEBKDD Workshop, Washington DC, 2003 [21] N. Polyzotis, M. Garofalakis, Structure and Value Synopses for XML Data Graphs, in Proc. of the 28th VLDB Conference, 2002 [22] N. Polyzotis, M. Garofalakis, Y. Ioannidis, Approximate XML Query Answers, in Proc. of SIGMOD 2004, Paris, France, 2004 [23] M. Richardson, P. Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, in Neural Information Processing Systems, 14, pp.1441-1448, 2002 [24] R.R. Sarukkai, Link Prediction and Path Analysis Using Markov Chains, in Computer Networks, 33(1-6): 337-386, 2000 [25] R. Sen, M. Hansen, Predicting a Web user’s next access based on log data, in Journal of Computational Graphics and Statistics, 12(1):143-155, 2003 [26] J. Zhu, J. Hong, J. G. Hughes, Using Markov Models for Web Site Link Prediction, in Proc. of ACM HT’02, Maryland, 2002