Academia.eduAcademia.edu

Analysis of the query logs of a Web site search engine

2005, Journal of the American Society for Information Science and Technology

A large number of studies have investigated the transaction log of general-purpose search engines such as Excite and AltaVista, but few studies have reported on the analysis of search logs for search engines that are limited to particular Web sites, namely, Web site search engines. In this article, we report our research on analyzing the search logs of the search engine of the Utah state government Web site. Our results show that some statistics, such as the number of search terms per query, of Web users are the same for general-purpose search engines and Web site search engines, but others, such as the search topics and the terms used, are considerably different. Possible reasons for the differences include the focused domain of Web site search engines and users' different information needs. The findings are useful for Web site developers to improve the performance of their services provided on the Web and for researchers to conduct further research in this area. The analysis also can be applied in e-government research by investigating how information should be delivered to users in government Web sites.

Analysis of the Query Logs of a Web Site Search Engine Michael Chau School of Business, The University of Hong Kong, Pokfulam, Hong Kong. E-mail: [email protected] Xiao Fang College of Business Administration, The University of Toledo, Toledo, OH 43606. E-mail: [email protected] Olivia R. Liu Sheng School of Accounting and Information Systems, The University of Utah, Salt Lake City, UT 84112. E-mail: [email protected] A large number of studies have investigated the transaction log of general-purpose search engines such as Excite and AltaVista, but few studies have reported on the analysis of search logs for search engines that are limited to particular Web sites, namely, Web site search engines. In this article, we report our research on analyzing the search logs of the search engine of the Utah state government Web site. Our results show that some statistics, such as the number of search terms per query, of Web users are the same for general-purpose search engines and Web site search engines, but others, such as the search topics and the terms used, are considerably different. Possible reasons for the differences include the focused domain of Web site search engines and users’ different information needs. The findings are useful for Web site developers to improve the performance of their services provided on the Web and for researchers to conduct further research in this area. The analysis also can be applied in e-government research by investigating how information should be delivered to users in government Web sites. Introduction The amount of information on the World Wide Web is growing rapidly, and search engines have become increasingly important in helping users in information retrieval and other activities on the Web. General-purpose search engines, such as Google (http://www.google.com/), Excite (http://www.excite.com/), and AltaVista (http:// www.altavista.com/), have been widely used. Many users begin their Web activities by submitting a query to a search engine. These search engines use Internet spiders/crawlers Received January 23, 2004; revised July 26, 2004; accepted August 24, 2004 • © 2005 Wiley Periodicals, Inc. Published online 31 August 2005 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20210 that automatically collect Web pages and create an index that can be searched by users (Chau & Chen, 2003b). As these general-purpose search engines do not restrict themselves to particular domains or specialties, they often try to collect as many Web pages as possible. However, as the number of indexable pages on the Web has exceeded 3 billion, it has become more difficult for these search engines to keep an up-to-date and comprehensive search index; low precision and low recall rates often result. On the other hand, many Web sites have their own search engines. A Web site search engine is one that allows users to search for pages only within a particular Web domain or Web host. It is often found in the main page of a Web site and only indexes pages in that particular Web site. For example, going to the homepage of Microsoft Corporation (http://www.microsoft.com/), one would see a text box on the page that allows users to perform a search restricted to the microsoft.com Web site. Because a Web site search engine only needs to work with a limited set of pages, it can provide customized features to users and can be updated much more frequently (e.g., daily or even hourly) than general-purpose search engines. In addition, a Web site search engine is often more comprehensive because it can index pages that are not accessible by search engine crawlers/spiders, and no general-purpose search engines can cover every single page in every single Web site on the Internet. Although Web site search engines are very useful for users looking for information on a particular Web site, these search engines often have different user interfaces, interpret queries in different ways, support different types of advanced search functionalities, and employ different search algorithms. From end users’ point of view, dealing with an array of different interfaces and understanding each one’s idiosyncrasies add much confusion and present an additional layer of information and cognitive overload. Most users find it difficult to adapt to different search engines and cannot JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 56(13):1363–1376, 2005 fully utilize their capabilities. Therefore, it is very important to understand better the information needs and the search behavior of users in order to build better Web site search engines. Ideally, we would sit behind the users’ back, watch how they perform searches using these search engines, and record their actions and search results. Such observation is, however, not feasible for large-scale evaluation of Web site search engines because of the large number of users who are scattered around the world. Although it is possible to capture users’ actions on their computers (e.g., their clicking on a mouse button or scrolling a window on the screen) by using client-side monitoring techniques, doing so often requires the installation of specific software or “plug-ins” on the users’ computers (Montgomery & Faloutsos, 2001; Fenstermacher & Ginsburg, 2003). As these techniques require extra time and effort from the users and introduce privacy concerns, most users are not willing to install such software. Alternatively, server-side search engine log data can provide much information about users’ search behavior and information needs. Server-side data are easy to collect without any extra effort from the users, and they are much more comprehensive and scalable than client-side data as they can cover every single user who has visited the Web page or search engine of interest. Several commercial and research projects have reported on the analyses of such data for various Web applications. For example, studies on the Excite query logs have been widely published (Jansen, Spink, Bateman, & Saracevic, 1998; Jansen, Spink, & Saracevic, 2000; Ross & Wolfram, 2000; Spink, Jansen, Wolfram, & Saracevic, 2002; Spink, Wolfram, Jansen, & Saracevic, 2001). Analysis of the AltaVista query logs also has been reported (Silverstein, Henzinger, Marais, & Moricz, 1999). Such analyses can provide much information about the information needs and searching behavior of search engine users. There is, however, little research reported on the analysis of Web site search engine logs. It has been shown that the information needs of users of Web site search engines can be quite different from those of users of general-purpose search engines (Wang, Berry, & Yang, 2003). It would be interesting to compare the various metrics across different types of search engines. In this article, we report our research on the analysis of the search query logs collected from a Web site search engine. We study the information needs and search behavior of the users for the search engine and compare them with those of general-purpose search engine users. The article is structured as follows: In the second section we review related research in Web mining and search engine log analysis. We pose our research questions in the third section. The fourth section discusses the data and the methods we used in this research. In the fifth section we present and discuss the findings of our analysis. We conclude the article in the sixth section with a summary of our study and some future directions. 1364 Related Studies Analysis of Web log data or search engine log data can be categorized under the research area of Web mining. The term Web mining was first used by Etzioni (1996) to denote the use of data mining techniques to discover Web documents and services, extract information from Web resources, and uncover general patterns on the Web automatically. Over the years, Web mining research has been extended to cover the use of data mining as well as other similar techniques to discover resources, patterns, and knowledge from the Web and Web-related data (such as Web usage data or Web server logs). Web mining research can be classified into three categories: Web content mining, Web structure mining, and Web usage mining (Kosala & Blockeel, 2000; Chau, & Chen, 2003a). Web content mining is the discovery of useful information from Web contents, including text, images, audio, and video. Web content mining research includes resource discovery from the Web (e.g., Chakrabarti, Chau et al., 2003; van den Berg, & Dom, 1999; Chau & Chen, 2003a), document categorization and clustering (e.g., Chen, Fan, Chau, & Zeng, 2001; Kohonen et al., 2000; Zamir & Etzioni, 1999), and information extraction from Web pages (e.g., Hurst, 2001). Web structure mining studies the model underlying the link structures of the Web. It usually involves the analysis of in-links and out-links information of a Web page and has been used for search engine result ranking and other Web applications (Brin & Page, 1998; Kleinberg, 1998). Web usage mining focuses on using data mining techniques to analyze search logs or other activity logs to find interesting patterns. One application of Web usage mining is to learn user profiles (e.g., Armstrong, Freitag, Joachims, & Mitchell, 1995). Data such as Web traffic patterns or usage statistics also can be extracted from Web usage logs in order to improve the performance of a Web site (e.g., Cohen, Krishnamurthy, & Rexford, 1998; Fang & Sheng, 2004). The mining of search engine logs, usually focused on the study of how users use the search engines on the Web to satisfy their information needs, belongs to the category of Web usage mining. On the other hand, it also is a strongly rooted in information retrieval research. Many studies have reported analysis of user information behavior, search queries, and search sessions with various information retrieval and digital libraries systems (e.g., Fenichel, 1981; Bates, Wilde, & Siegfried, 1993). Since the Internet evolution, we have seen many studies devoted to search engines and information systems on the Web. The first category of Web search engine log research focused on analyzing the search logs submitted to general-purpose search engines. In 1998, Jansen, Spink, and several others started a series of studies on the search logs that were made available by Excite. Their first study analyzed a set of 51,473 queries submitted to the Excite search engine in 1997 (Jansen et al., 1998; Jansen et al., 2000). Subsequently, they expanded their research and analyzed three sets of data collected in 1997, 1999, and 2001, each containing at least 1 million queries submitted to the Excite search engine (Spink et al., JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 2001, 2002; Wolfram, Spink, Jansen, & Saracevic, 2001). Researchers were also able to obtain interesting findings on the information needs and search behavior of users, such as the trends in Web searching (Spink et al., 2002), sexual information searching on the Web (Spink, Ozmutlu, & Lorence, 2004), and question format Web queries (Spink & Ozmultu, 2002). Another large-scale Web query analysis was performed by Silverstein and associates (1999) on a set of 993 million requests submitted to the AltaVista search engine over a period of 43 days in 1998. Most of these studies used a set of similar metrics or statistics in their studies, including number of sessions, number of queries, number of queries in a session, number of terms in a query, percentage of queries using Boolean queries, and number of result pages viewed by each user. These metrics allow researchers to compare their findings across different types of search engines at different times. The second category of research focused on analyzing the search logs of a specific Web site or system. However, only a few studies have been reported in this category. One example is the study of Croft, Cook, and Wilder (1995), which investigated the search queries submitted to the THOMAS system, an online searchable database consisting of U.S. legislative information. They analyzed 94,911 queries recorded in their system and identified the top 25 queries. They also found that 88% of all queries contain three or fewer words, a number much lower than that of traditional information retrieval systems. Jones, Cunningham, and McNam (1998) analyzed the transaction logs of the New Zealand Digital Library that contained a collection of computer science technical reports. They obtained similar results regarding the number of words in queries: almost 82% of queries were composed of three or fewer words. Their study also found that most people use the default settings of search engines without any modifications. Wang, Berry, and Yang (2003) analyzed the search queries submitted to the search engine of the University of Tennessee, Knoxville, over a period of 4 years. They performed a longitudinal analysis on the search queries and found that seasonal patterns exist in Web site searching. They also identified some differences between the search queries submitted to a general-purpose search engine and a Web site search engine, in terms of search topics, search term distribution, and mean length of queries. Research Questions Most previous studies focused on the users of generalpurpose Web search engines, and their findings may not be applicable to Web site search engines that have their own characteristics and groups of users. Consequently, a Web site search engine should be designed in a customized way according to its users’ information needs, which we suggest can be identified from the search engine’s query log data. We address the following research questions in this study: (1) What are the characteristics of the search queries submitted to a Web site search engine? (2) How do these queries compare to those submitted to general-purpose search engines? (3) How can these results be used to improve the design of the Web site search engine? Data and Methods In this study, we collected the search queries submitted to the Utah state government Web site in the United States (http://www.utah.gov/), captured over a period of 168 days from March 1, 2003, to August 15, 2003. The Utah state government Web site was one of the most advanced government Web sites and was named the best state government Web portal in the United States by the Center for Digital Government (Center for Digital Government, 2003). The Web site search engine is accessible from a text box with the text “Search Utah.gov” near the top of the main page of the Web site (see Figure 1). A user can enter a search query in the box and click on the Go button to submit the query to the Web site search engine and obtain the search results. Alternatively, the user can go to the Web page (http:// info.utah.gov/) to type in the search query and specify the search options (see Figure 2). In total, there are 1,895,680 records in the Web transaction log. Each record in the log file represents a request sent from a user to the search engine. A request can be a search query (requesting either the first page of search results or subsequent pages beyond the top 25 results), a request for viewing the actual document in the search result, or a request for an image file for display. Each record contains 14 fields, including date, time, Internet Protocol (IP) address, type of request submitted, and the parameters of the request. An example of a record is given in Table 1. We are most interested in the fields Date, Time, Client IP Address, Request method, Uniform Resource Identifier (URI) stem, URI query, and User cookie. In the following we discuss the use of these fields. The first field we used for our processing is the URI Stem, in which the value “/search.asp” specifies that the transaction is search-related. Other non-search-related transactions have the value of a file name in the URI Stem field, e.g., “/Images/RankLow.gif.” Because the requests for such static documents or image files are not relevant to our study, these transactions were removed from our records. Some other irrelevant transactions were also removed from the logs. Thus, the remaining logs represent two major types of transactions: a user submitting search query or a user viewing an actual document on the Web site in the search result. The second type of query is especially interesting because most previous research on search engine log analysis (such as the studies on AltaVista and Excite) did not capture or study whether users click on any actual documents in the search result pages. Among the search queries, 443,342 queries were generated by Web spiders—programs that automatically collect pages from the Web. These spiders sent queries to the search engine to generate search results for search engine indexing (Chau & Chen, 2003b). The queries generated by search spiders can be identified by the User-Agent field in the transaction log. For example, the queries submitted by the search JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1365 FIG. 1. The main page of the Utah state government Web site with access to the search engine. spiders from FirstGov.gov, a search engine for U.S. government information, have the value “FirstGov.gov⫹Search⫹⫹POC:[email protected]” in the User-Agent field in the log, but the same field for a transaction submitted by a real user would have the name and features of the Web browser used by the user, such as “Mozilla/4.0⫹ (compatible;⫹MSIE⫹6.0;⫹Windows⫹NT⫹5.0).” Because the queries generated by the search spiders do not represent the real information needs of real users, we also removed these transactions from the logs. One should note that there may still be some automatically generated queries in the log data after the filtering because search engine spiders can “fake” themselves as real users when submitting their queries to the Utah search engine. However, the number of such queries should be relatively small and hence negligible. The data were then loaded into a relational database for further processing and analysis. The next step was to identify the different sessions in the search query data.Asession is a series of queries submitted by a single user within a small range of time (Silverstein et al., 1999). A session represents a set of queries relevant to a user’s single information need. To 1366 identify session information, we first had to identify the unique usersinthetransactionlogs.Wedidsoonthebasisofthecookie information and the IP address of each user. As each browser was assigned a unique cookie by a Web site, we could identify each unique browser used by the users. Although it is possible that users may share the same computer (e.g., in a public library), the possibility did not affect our identification of sessions very much. The IP address is less reliable in identifying unique users from a Web transaction log because the same user may be assigned two different IP addresses in different sessions, and two different users may share the same IP address. The intermediate proxy servers may even assign the same IP to all users, a problem known as the AOL effect (Pierrakos, Paliouras, Papatheodorou, & Spyropoulos, 2003). Because the use of cookies is more reliable, it was chosen as the first metric in our processing. In cases in which the cookie informationwasnotavailable(e.g.,disabledbytheuser),theIP address was used. Each user identified was assigned a unique identification (ID) in our database. We then ordered the search queries for each user according to the timestamp in the transaction logs. Following previous JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 FIG. 2. TABLE 1. The search engine of the Utah state government Web site. Example of a record in the search log data. Field Date Time Client IP address Client-to-server username Server IP address Server port Request method URI stem URI query Server-to-client status Server-to-client bytes User agent (browser) User cookie Referrer URL Value 2003-03-01 00:02:40 66.119.xxx.xxx (hidden for privacy reasons) — 198.239.xxx.xxx (hidden for security reasons) 80 GET /search.asp postFlag⫽1&opr1⫽1&val1⫽dmv&MSS. request.Search%20Catalog...… 200 0 Mozilla/4.0⫹(compatible;⫹MSIE⫹5.5;⫹ Windows⫹NT⫹5.0;⫹T312461) — http://utah.gov/government/judicial.html Note. IP, Internet Protocol; URI, Uniform Resource Identifier; URL, Universal Resource Locator. research that suggested that queries for a single information need should be close together in terms of time (Silverstein et al., 1999), we used a cutoff of 30 minutes to identify sessions. If a user submitted a query within 30 minutes of the previous transaction, these transactions were included in the same session. On the other hand, if a user did not submit any requests to the Web server for 30 minutes, anything submitted afterward would be included in a new session. Each session was assigned a unique session ID in our database. Because the Utah state government Web site search engine employs the Active Server Pages (ASP) method in their Web transactions, each query had to be parsed from a set of parameters from the URI Query field in the transaction log before it could be further processed.Asimple program was written in Java to perform this task. All queries were stored in lowercase letters. The terms, Boolean operators, and other search features in the query were also identified for each query and stored in the database. The data consist of 1,115,388 transactions in total, in which 792,103 transactions are search queries and 323,285 are requests for actual documents in the search result. An overview of our data is presented in Table 2. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1367 Overview of the search log data. Total number of search queries Total number of requests for actual documents in the search result Total number of unique users Total number of sessions TABLE 3. 792,103 323,285 161,042 458,962 Analysis Results In this section we present the characteristics of our query data logs and the results of our analysis. We first generate descriptive statistics about the queries and search sessions of users and then compare them with the results from other studies. We also analyze how users utilize the advanced search features provided by the Web site search engine. Finally, we perform text analysis on the queries submitted by the users to identify the most common query words used and the associations between search terms. Sessions, Queries, and Topics Sessions and queries. As discussed earlier, there are 792,103 queries in total, submitted by a total of 161,042 unique users in 458,962 sessions. We classified the 792,103 queries into three groups, namely, unique queries, repeat queries, and empty queries (Spink et al., 2001). Unique queries are all differing queries entered by one user in one session. The differing queries can be new queries or modifications of the previous query. Repeat queries are the repeat occurrences of any query that appears previously in a session. Such repeat occurrences of a query also result from the viewing of subsequent result pages by the users. Empty queries are queries that contain no query terms. In our study, we found that a large number of users clicked on the Go button on the home page of the Utah state government Web site (see Figure 1) without changing the text “Search Utah.gov” in the search box. The result was a search query of “Search Utah.gov” submitted to the search engine. Because the users did not enter any search term in such cases, we also consider these as empty queries. Out of the 792,103 queries, 575,389 (72.6%) are unique queries, 98,418 (12.4%) are repeat queries, and 118,296 (14.9%) are empty queries. The mean number of queries in each session is 1.73 (with a median of 1), and the mean number of unique queries in each session is 1.25 (with a median of 1). This number is much lower than the number of 2.52 reported in the Excite study (Spink et al., 2001) and 2.02 reported in the AltaVista study (Silverstein et al., 1999). The results are summarized in Table 3. To study the number of queries per session in more detail, we look at the distribution of the numbers. In out study, 73.0% of sessions contain only one search query; 12.4% of sessions contain only two (see Figure 3). As can be seen, the distribution is skewed toward the lower end in terms of the number of queries submitted. About 96.1% of users submitted four or fewer queries in a session. This finding is 1368 Statistics on queries. Total number of search queries Total number of unique queries Total number of repeat queries Total number of empty queries Mean number of queries per session Median number of queries per session Mean number of unique queries per session Median number of unique queries per session 792,103 575,389 98,418 118,296 1.73 1 1.25 1 consistent with those of other studies, in which most sessions are very short and contain only one or two queries (Spink et al., 2001; Silverstein et al., 1999). There are two possible reasons for the lower number of queries submitted to our search engine per session when compared with results reported in previous studies. First, when compared with the Excite study, our study has a different definition of sessions involving time-out and cookie information, which would result in a larger number of sessions. Second, it is possible that many users were able to find the results they wanted in the first query and therefore did not need to modify their queries to perform a new search. This reult may indicate a special characteristic of Web site search engines because they need to index only a limited number of pages; higher precision and recall in the search results may be achieved more easily. At the same time, users may have a better idea of what they want specifically when using a Web site search engine, so they may be able to formulate better queries in the first step. Therefore, there is a smaller need for users to modify their search queries. It is also possible that people use generalpurpose search engines and Web site search engines differently for other reasons. As previous Web site search engine studies do not include session information (Croft et al., 1995; Jones et al., 1998; Wang et al., 2003), further research will be needed. Number of Queries Submitted Per Session 0.8 0.7 0.6 % of Sessions TABLE 2. 0.5 0.4 0.3 0.2 0.1 0 1 10 20 30 40 50 60 70 80 90 Number of Queries Submitted FIG. 3. Number of queries submitted per session. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 100 TABLE 4. Comparison of top 25 queries with AltaVista and Knoxville. This study Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 AltaVista study Knoxville study Query Frequency Query Frequency Query Frequency dmv tax forms sex offenders forms jobs divorce unemployment employment notary secretary of state sales tax map sex offender dopl ors maps real estate taxes tax birth certificates birth certificate drivers license medicaid child support “dmv” 3,794 2,532 2,173 2,036 1,587 1,400 1,359 1,257 1,061 1,053 941 921 890 884 879 849 798 763 748 741 729 717 714 699 656 sex applet porno mp3 chat warez yahoo playboy xxx hotmail [non-ASCII] pamela anderson p**** [vulgarity] sexo porn nude lolita games spice girls beastiality animal sex SEX gay titanic Bestiality 1,551,477 1,169,031 712,790 613,902 406,014 398,953 377,025 356,556 324,923 321,267 263,760 256,559 234,037 226,705 212,161 190,641 179,629 166,781 162,272 152,143 150,786 150,699 142,761 140,963 136,578 career services grades tuition housing timetable bookstore Rocky Top transcripts Daily Beacon employment cheerleading band registration scholarships jobs football tickets career marching band cheerleaders resume financial aid webmail tickets transcript catalog 9,587 5,727 4,837 4,203 4,097 3,453 2,582 2,340 2,312 2,156 1,985 1,914 1,683 1,537 1,488 1,465 1,407 1,397 1,377 1,375 1,331 1,317 1,225 1,211 1,187 Search Queries and Search Topics. To investigate the search topics of the users, we identified the top 25 queries submitted to the Utah search engine and compared them with that of the AltaVista study (Silverstein et al., 1999) and the Knoxville study (Wang et al., 2003), which analyzed the user queries submitted to the search engine of the University of Tennessee, Knoxville. The data are shown in Table 4. The top three topics identified in our study are “dmv,” “tax forms,” and “sex offenders.” From the table, it can be seen that the top queries are quite different across the three search engines. The top AltaVista queries are mostly related to sexual information, software, music, and entertainment, and the queries submitted to the Knoxville search engine are mostly related to academic matters. The top queries in our study, however, are government-related. In other words, search queries submitted to the Web site search engines are generally relevant to the corresponding domain. The results suggested that general-purpose search engines and Web site search engines have to be designed differently according to users’ different information needs and query characteristics. Seasonal effect of search topics. In their longitudinal analysis of a Web query log that covered a period of 4 years, Wang and coworkers (2003) showed that a seasonal effect exists in an academic Web site search engine. They found that the query “career services” occurred mostly in February, March, September, and October, and the query “football tickets” appeared mostly in August and September. This interesting finding had not been identified in previous generalpurpose search engine research, in which the data were often limited to a short period (e.g., 43 days in the AltaVista study and 1 day in the Excite study). The data in our study covered a period of about 5.5 months (168 days). Though the data did not cover a whole calendar year, it is also interesting to see whether any particular seasonal patterns exist in our data. To this end, we took the top three queries, namely, “dmv,” “tax forms,” and “sex offenders,” and analyzed their daily search frequencies. The results are plotted in Figure 4. No apparent seasonal patterns were found for the queries “dmv” and “sex offenders” (see Figures 4a and 4c). However, it is interesting to note that there are slightly more requests for “sex offenders” in March. This increase is probably not related to seasonal effect; as we suggested, it may be caused by the fact that the U.S. Supreme Court ruled on March 5, 2003, that it is legal for state governments to put pictures of convicted sex offenders on the Internet (CBS News, 2003). This news was widely covered in the media, and that coverage may have drawn the public to look for the sex offender listing in the Utah state site and for other relevant information on the Web site. On the other hand, it can be easily seen that the “tax forms” query demonstrated a very strong seasonal effect in our data (see Figure 4b). The number of search queries for “tax forms” had been steady since the beginning of March (or possibly earlier but we do not have the data to verify). The JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1369 (a) 1000 100 900 90 800 80 700 70 600 60 500 50 400 40 300 30 200 20 100 0 3/1/2003 10 0 3/1/2003 (b) 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 FIG. 5. 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 Analysis of daily frequencies of the tax-related queries. 400 350 300 250 200 150 100 50 0 3/1/2003 (c) 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 100 90 80 70 60 50 40 30 the search topics are. The basic statistics are shown in Table 5. In total, there are 1,518,984 terms in the queries. Out of these terms, 67,958 are unique terms. The longest query consists of 40 terms. The mean number of terms in a query is 2.25, with a median of 2. This finding is consistent with the result in the AltaVista study and the Excite study, which found the mean number of terms in a query to be 2.35 and 2.21, respectively. The finding reiterates that Web queries are much shorter than those of traditional information retrieval systems. The distribution of the number of terms per query is shown in Figure 6. Some 30.7% of queries contain only a single term, 37.0% contain two, and 19.2% contain three. About 97.6% of queries contain five or fewer terms. The results show that Web users are likely to use short queries, whether they are using general-purpose search engines or Web site search engines. 20 10 0 3/1/2003 FIG. 4. Analysis of daily frequencies of the top queries: (a) “dmv,” (b) “tax forms,” (c) “sex offenders.” number of requests peaked on April 15, the deadline for filing individual tax returns in the United States, with 364 requests on a single day. The number dropped dramatically after the deadline had passed, with an average of fewer than 10 requests per day. This finding corroborates the seasonal effect identified by Wang and associates (2003) discussed earlier. To analyze further the seasonal effect of tax-related queries, we analyzed the daily requests for a broader set of tax-related queries, including all queries that contain the terms “tax,” “irs,” or “internal revenue.” The result is shown in Figure 5. It can be seen that the pattern is similar to that in Figure 4b, in which the number reached the peak on April 15 and decreased quickly afterward. Search Terms Search terms in queries. Similarly to other Web search studies, we analyze the search terms in each query submitted to the Web site search engine. Analysis of the search terms can reveal how users formulate their search queries and what 1370 Search term distribution. Several previous studies on Web search analysis have suggested that the distribution of terms used in Web search engines largely follows the Zipf distribution (Jansen et al., 2000; Spink et al., 2001). In a Zipf distribution, the quantity of interest is inversely proportional to its rank, and the Zipf distribution represents the distribution of terms in long English texts. To see whether the same pattern is observed in our data, a double-log rank-frequency chart, as shown in Figure 7, was used. The plot should be close to a straight line for a Zipf distribution. It can be seen that the plot follows the Zipf distribution, with some discrepancies for the high- and low-ranking TABLE 5. Statistics on terms. Total number of terms Total number of unique terms Mean number of terms per query Median number of terms per query Largest number of terms per query Percentage of nonempty queries with one term Percentage of nonempty queries with two terms Percentage of nonempty queries with three terms Percentage of nonempty queries with four terms Percentage of nonempty queries with five terms Percentage of nonempty queries with six or more terms JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1,518,984 67,958 2.25 2 40 30.7% 37.0% 19.2% 7.6% 3.2% 2.4% TABLE 6. Number of Terms Per Query Comparison of top 50 query terms with Excite and Knoxville. This study 0.4 0.35 Rank Term Frequency Excite study Knoxville study Term Frequency Term and of sex free the nude pictures in university pics chat for adult women new xxx girls music porn to gay school home college state naked american stories software games diana p**** black on photos jobs world a magazine nudes news football page computer princess airlines download real education art 21,385 12,731 10,757 9,710 8,013 7,047 5,939 5,196 4,383 3,815 3,515 3,431 3,385 3,211 3,109 3,010 2,732 2,490 2,400 2,265 2,187 2,176 2,150 1,043 2,010 1,968 1,961 1,958 1,908 1,904 1,885 1,876 1,823 1,813 1,799 1,735 1,734 1,711 1,690 1,690 1,687 1,627 1,591 1,533 1,461 1,409 1,381 1,381 1,376 1,374 of services career student and grades school tuition housing football timetable schedule center office band for department UT Tennessee graduate % of Queries 0.3 0.25 0.2 0.15 0.1 0.05 0 1 10 20 30 40 Number of Terms FIG. 6. Number of terms per query. Rank-frequency Distribution of Terms log(term frequency) 5 4 3 2 1 0 0 1 2 3 4 5 log(term rank) FIG. 7. Double-log rank-frequency graph for search terms. terms. The slope of our plot is ⫺0.9533, which is close to the theoretical value of ⫺1 for a Zipf distribution. When compared with the distribution reported in the Excite study (Jansen et al., 2000; Spink et al., 2001), the distribution in the present study corresponds better to the Zipf distribution, especially for the lower end of the curve (terms with low frequency). Although a more sophisticated model may be needed, the preliminary finding suggests that there is a smaller percentage of infrequently used terms in Web site search engines than in general-purpose search engines. One possible explanation is that although the users of both types of search engines use a diverse set of terms, Web site search engines are restricted to particular domains and the proportion of unique or low-frequency terms is therefore also restricted. Search terms and search topics. To study the search topics of the users further, we identified the top 50 terms in the queries in our study and compared with those of the Excite study (Spink et al., 2001) and the top 20 terms reported in the Knoxville study (Wang et al., 2003). The results are shown in Table 6. The general pattern is similar to that of the query analysis presented in Table 4. First, the terms used in Web 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 utah of and tax state license county forms department lake business for search sex form in city salt services laws registration child dmv code insurance sales health public the vehicle division jobs records water income offenders a map property application unemployment office motor commission notary court to employment divorce marriage 40,425 40,072 29,208 28,457 25,681 15,099 12,115 12,049 9,647 9,398 8,866 8,816 8,612 7,626 7,560 7,298 6,727 6,649 6,383 6,188 6,016 5,929 5,825 5,661 5,545 5,545 5,439 5,289 5,049 5,036 4,974 4,840 4,752 4,730 4,706 4,659 4,648 4,373 4,348 4,238 4,213 4,020 4,012 3,931 3,929 3,873 3,872 3,835 3,747 3,678 site search engines are very different from those used in general-purpose search engines. Although some functional words are common across all three studies (such as “and,” “of,” and “for”), the semantic words are very different. The top Excite terms are mostly related to sexual information; the query terms submitted to the Web site search engines are mostly relevant to the corresponding domain (e.g., “utah,” “tax,” and “state” for the Utah state government Web site and JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1371 “career,” “student,” and “grades” in the academic domain). Other popular search terms identified in our study include “license,” “country,” “forms,” “department,” and “laws” (see Table 6). Again, the results suggested that users have different information needs when using general-purpose search engines and Web site search engines. It is also interesting to note that the term “sex” also ranks 14th in our study. After looking into the queries containing the term, we found that most of these queries were submitted to search for information concerning the list of sex offenders in the state of Utah. By contrast, in the Excite data a large percentage of queries are related to sex and pornography (Spink et al., 2002; Spink et al., 2004). When analyzing the functional words, we found that both “and” and “of” appear in the top five in all three studies; “and” is frequently used because it can be used as a Boolean operator as well as a term on its own. On the other hand, it is interesting to note that although the word “of” is ignored by many search engines unless clearly specified by the user (e.g., Google), it is still frequently used by Web searchers. As pointed out by Wang and colleagues (2003), generalpurpose search engines such as Excite appear to have more functional words in the top terms, and Web site search engines have more semantic terms. As can be seen from Table 6, Excite has 6 functional words in the top 20 (“and,” “of,” “the,” “in,” “for,” “to”), but the Knoxville data has only 3 (“of,” “and,” “for”), and this study only has 4 (“of,” “and,” “for,” “in”). One possible reason is that the search queries in general-purpose search engines are more diverse, such that fewer semantic words appear frequently enough to appear in the top 20 list. However, for Web site search engines, the search queries in limited domains include more semantic words on the same topics that are more frequently used and thus have a higher rank. To analyze further the topics submitted to the Utah state government Web site search engine, we also studied which terms were used more frequently together. Terms used together are often more informative in identifying which topics are frequently searched by users. The top 50 term pairs are shown in Table 7. Many frequently searched topics can be identified from the data, e.g., “tax forms,” “sales tax,” “state tax,” “sex offenders,” “business license,” and “birth certificate.” As can be seen from Table 7, many of term pairs appear to be part of a group of three or more terms that appear frequently together. For example, the pairs “state-utah” and “of-utah” are likely to be part of the phrase “state of utah,” and the pairs “department-of,” “division-of,” “motorvehicle,” “motor-vehicles,” and “of-vehicles” are possibly part of the phrase “department of motor vehicles” and its variations. In order to identify these topics, we analyzed our data again and identified the groups of three and four terms that appear most frequently in the queries. The most frequent term groups with three terms are listed in Table 8 and those with four terms are listed in Table 9. In Table 8, search topics such as “Salt Lake City,” “Utah State Tax,” “State of Utah,” and “Secretary of State” can be easily identified. The table also reveals that some term 1372 TABLE 7. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Top 50 term pairs. Term pair department state of lake tax of sales state offenders division in income estate tax code city city offender of drivers and commission of child of county county form county notary commerce of secretary motor and motor return motor recovery articles business park parks birth security application department board of child Frequency of utah utah salt forms state tax tax sex of utah tax real utah utah lake salt sex services license of tax office support the lake utah tax salt public of secretary state vehicles utah vehicle tax of services of license state state certificate social for utah of vehicles care 7,622 6,997 6,988 6,335 5,469 4,471 4,372 4,333 4,145 3,734 3,658 3,596 3,463 2,908 2,667 2,587 2,549 2,401 2,313 2,194 2,156 2,057 2,006 1,921 1,812 1,791 1,772 1,772 1,762 1,683 1,680 1,647 1,641 1,640 1,633 1,630 1,501 1,495 1,462 1,368 1,326 1,302 1,299 1,293 1,284 1,274 1,234 1,227 1,225 1,203 groups still appear to be part of an even longer phrase, e.g., “motor-of-vehicles,” “department-of-motors,” etc. In Table 9, we identified the top three four-term groups that co-occurred most frequently in the queries. One can easily see that these represent three search topics frequently requested by users, namely, “Office of Recovery Services,” “Department of Motor Vehicles,” and “Utah State Tax Commission.” The results show that such information is frequently requested by users and suggests that Web site designers should allow users to access it more easily JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 Top term groups with three terms. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Term groups salt salt utah state secretary motor department of bill state department department of of office department department department utah state income state utah TABLE 9. lake lake state of of of of office of tax of of office recovery recovery of motor of tax tax tax income state Frequency city county tax utah state vehicles utah recovery sale commission motor commerce services services services services vehicles vehicles commission forms forms tax commission 2,524 1,749 1,613 1,611 1,610 1,168 1,082 1,039 1,005 1,000 989 980 944 922 917 870 822 812 811 774 769 755 734 Top term groups with four terms. Rank 1 2 3 TABLE 10. Result pages and actual documents viewed. Term group office department utah of of state Frequency recovery motor tax services vehicles commission 894 805 702 (e.g., through links from the main page of the Utah state government Web site). Result Pages and Actual Documents Viewed Search result pages. During a search session with the Utah state government Web site, a user first submits the query to the search engine and the first result page listing the top 25 search results will be shown. If there are more than 25 hits for the search, the user can request subsequent result pages (sometimes known as result screens). As mentioned earlier, these are often counted as repeat queries in Web search studies. On average, each user views 1.47 pages in the search results in our study (see Table 10). This result is comparable to the one reported in the AltaVista study, in which the average number of result pages viewed in a session was 1.39. The Excite study also found that 28.6% of users examined only the first page of the search results. Although the Utah search engine provides 25 search results in the first page, we do not see a big difference in the number of result pages viewed per user. In general, all results suggested that most Web users browse only a small number of result pages. Number of result pages (25 items each) viewed per session Number of actual documents in the result list viewed per session Percentage of sessions viewing zero document in the search results Percentage of sessions viewing one document in the search results Percentage of sessions viewing two documents in the search results Percentage of sessions viewing three or more documents in the search results 1.47 0.70 59.4% 28.3% 6.9% 5.4% Actual documents viewed. When browsing through the result pages, if the user sees an item in the search result of interest, he or she can click on the link and view the actual content of the searched document. The Utah state government Web site site search engine has been designed in such a way that when such an action is performed, it is logged in the transaction. In total, there are 323,285 records of such requests. On average, only 0.70 document is viewed in each session, with a median of 0. Only 0.56 document is viewed for each unique query, and 0.48 was viewed for each result page viewed by the user. In 59.4% of sessions, no single document was viewed by the user; in 28.3% of sessions, only one document was viewed. The distribution is shown in Figure 8. As can be seen, once again the distribution is highly skewed toward the lower end, suggesting that most users viewed only a very small number of result pages during a search session. The result is quite surprising, even though it is consistent with the findings of Jones and colleagues (1998), which suggested that users did not view the actual document content in 64% of the queries submitted to their Web-based digital library. We are not able to compare our findings with those of other large-scale Web search projects such as the AltaVista and the Excite studies because such data were not collected or analyzed in those studies. Were most users able to find the document they wanted on the basis of the title and snippet displayed in the result pages? Or were the results so bad that the users did not bother to click on them and look at the content? Did they give up and leave the site, or did they try to Number of Documents Viewed Per Session 0.7 0.6 0.5 % of Sessions TABLE 8. 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 Number of Documents Viewed FIG. 8. Number of documents viewed per session. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1373 locate the relevant document themselves by browsing? These questions indicate an interesting research topic for future study. Advanced Search Features and Default Settings As suggested by previous Web search studies, it is interesting to study how advanced search features such as Boolean operators are used when users formulate Web queries. The Utah state government Web site search engine provides two different types of advanced search features. In the first type, the user can choose among four options when performing a search: “Exact phrase,” “Free-text query,” “All of these words,” and “Advanced query.” The “Exact phrase” search is the default option, which adds a pair of quotation marks to the search query. The “Free-text query” allows users to search for documents containing any terms in the query submitted (a Boolean “OR” search). If the user chooses “All of these words,” the engine searches for documents containing all of the terms in the query submitted, but not necessarily appearing as a phrase (a Boolean “AND” search). The “Advanced query” allows users freely to specify AND, OR, NOT, quotation marks, and other operators in their search queries. To use the second type of advanced search feature, the users need to click on the “Advanced Search” link on the main search page. A new search page is displayed with a more complex search form. Using this form, the users can perform searches on different fields of documents (e.g., title, body, URL, author). The users can also specify a range of dates when the document was last updated, choose the number of results to be shown in one result page, and select how the results should be ordered. Our findings are summarized in Table 11, with data from the Excite study (Spink et al., 2001) listed for comparison. It can be seen that the Boolean operators AND, OR, and NOT were not used much; AND was used most frequently, appearing in 2.6% of queries. This finding is consistent with the results of the general-purpose Web search engine Excite (Jansen et al., 2000; Spink et al., 2001). About 3.4% of queries utilized the advanced search features provided in the complex search form. It is surprising to TABLE 11. Usage of operators and advanced search features. This study Feature AND OR NOT ⫹ (plus) ⫺ (minus) “” () Advanced Search Any of the above 1374 Number Percentage of queries of queries 20,206 657 692 154 364 230,094 811 26,744 272,474 Excite study Number of queries Percentage of queries 2.6% 29,146 3% 0.08% 1,149 1% 0.09% 307 0.0003% 0.0002% 44,320 5% 0.0005% 21,951 2% 29.0% 52,354 5% 0.1% (Not reported) (Not reported) 3.4% (Not reported) (Not reported) 34.4% (Not reported) (Not reported) find that this percentage is larger than that of use of Boolean operators. It suggests that a notable percentage of users prefer to utilize advanced search functions, such as searching in different fields, in performing Web searches to satisfy their information needs. As other studies did not report data about this aspect, it would be interesting to study this issue further. It is also interesting to note that although three of the operators listed in Table 11 (plus, minus, and parentheses) are not supported by the search engine, they are still used by some users. The most likely reason is that these operators are supported by several popular search engines such as AltaVista, Excite, and Google (though with different interpretations). It is possible that these users just assumed the Utah search engine would support these operators and thus utilized them in their search queries. A similar finding was also reported in the Excite study, in which some 6% of users used colons and periods, which are not supported by the Excite search engine. A surprising result in the analysis is the use of quotation marks in the search queries. The data showed that 29.0% of queries used quotation marks. This result is very different from the results reported in the Excite study, in which only 5.1% of queries made use of quotation marks. We found that the most possible cause of the large discrepancy is the default settings of the Utah search engine. When a user performs a search without making any changes to the search options, the “Exact phrase” search option is used as default and a pair of quotation marks are added to the queries. As many users do not change the default settings in information retrieval systems (Jones et al., 1998), the quotation marks are used exceptionally frequently. In total, 34.4% of queries utilized at least one of the search features. The number is much higher than the 20.4% reported in the AltaVista study, again because of the high usage of quotation marks in the search queries. Discussions Key Findings In general, we found that Web users behave similarly when using a Web site search engine and a general-purpose search engines in terms of the average number of terms per query and the average number of result pages viewed per sessions. However, the users of the Web site search engine show a lower number of queries per session and a different set of terms and topics used in their queries. We suggested the possible reason for these two differences is that users of Web site search engines have more specific information needs than those of general-purpose search engines. We also found that the search terms in a Web site search engine follow the Zipf distribution more closely than those in a general-purpose search engines, indicating that there is a smaller number of “rare terms” that are only used once or twice in our query log. In studying users’ behavior and usage patterns, we found that on average less than one document was actually viewed among the search results presented to the users. Although the JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 result also has been reported in another Web site search engine study (Jones et al., 1998), it has not been tested in large-scale search log study on general-purpose search engines, and the numbers (in both the study of Jones and coworkers [1998] and the present study) are much less than those reported in a monitored Web search study (Spink, 2002). Do people search differently in a real-world setting compared with an experimental environment? Further research is needed to explain the discrepancy found in these studies. Implications for Web Site Designers The similarities and differences between general-purpose search engines and Web site search engines can have important implications for the design of these search engines. Specifically, it is important to customize and improve Web site search engines on the basis of transaction log analysis. For example, as the transaction log in this study has revealed that many users rely on the default settings of the search engine without modification, it is important for search engine developers to ensure that the default options, such as Boolean operators, phrase search option, and the number of search results per page, are the most suitable setting for most users. It is also important to note that many users type their search queries in the query box without carefully looking at the original content in the box. As discussed earlier, we found that many search queries contain the phrase “Search Utah.gov,” which is the default text in the search box that tells users that searches can be performed. Web site designers should implement the search box carefully (e.g., by using client-side script) in order to prevent users from sending this type of query to the search engine becasue their doing so would result in poor search results and higher Web server load. We also showed that a large proportion of people are looking for information related to a small number of topics in Web site search engines, e.g., tax and Department of Motor Vehicles. Web site designers can learn more about users’ most-wanted information resources by analyzing the search logs or the Web access logs of a Web site. Web site designers should make the links to these resources easily accessible by users, e.g., by placing them prominently in the first page of the Web site. Conclusion and Future Directions In this article, we report our research on analyzing and mining the transaction log of a Web site search engine. The log data of the Web site search engine of the Utah state government was used as our test data. In our study of search terms and search topics, one limitation of analyzing term association on the basis of the word level is that we could not determine exactly how the words were used in the queries as phrases. Also, it is desirable to know the association between phrases rather than between terms (e.g., the association between the phrase “Department of Motor Vehicles” and other noun phrases). Future research can address this need by applying noun phrase extraction techniques, such as the Arizona Noun Phraser (Tolle & Chen, 2000), to search log data. As discussed earlier, our study also found that on average users viewed less than one document among the search results presented to them. This number is much smaller than the one reported in Spink (2002), in which the users performed searches in a monitored environment. Important potential research topics are why users view such a small number of documents in their Web searches and how their search behaviors differ in a real-life search task versus an experimental setup. It is also interesting to study how these statistics differ across the Web site search engines in different domains. As many countries have launched projects on e-government (or digital government), more and more government agencies are putting their information on the Web. The findings reported in this study also have important implications for e-government research by showing how governments can provide the general public with better access to their Web-based information by designing better Web site search engines. The analysis of the search logs helps us better understand what users are looking for on government Web sites. Although most users search for general information such as tax forms or governmental departments, further analysis of government search logs would be needed to investigate how users search for sensitive security information on government Web sites. Acknowledgments We gratefully thank the Utah state government and its provider, Utah Interactive, for kindly providing us the data and information used in this study. We also thank the two anonymous reviewers for their insightful suggestions. References Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T. (1995). WebWatcher: A learning apprentice for the World Wide Web. In Proceedings of the AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments. Menlo Park, CA: AAAI Press. Bates, M.J., Wilde, D.N., & Siegfried, S. (1993). An analysis of search terminology used by humanities scholars: The Getty Online Searching Project Report. Library Quarterly, 63(1), 1–39. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the Seventh WWW Conference. Retrieved August 24, 2000, from http://www7.scu.edu.au CBS News. (2003, March 5). Court allows states to throw the book. CBS News, March 5, 2003. Retrieved August 30, 2004, from http://www.cb snews.com/stories/2003/03/05/supremecourt/main542863.shtml Center for Digital Government. (2003). Utah State portal ranks no. 1. Retrieved August 30, 2004, from http://www.centerdigitalgov.com/ center/highlightstory.phtml?docid⫽69811 Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the Eighth International World Wide Web Conference. Retrieved August 30, 2004, from http://www8.org/w8-papers/5a-search-query/crawling/ index.html Chau, M., & Chen, H. (2003a). Comparison of three vertical search spiders. IEEE Computer, 36(5), 56–62. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005 1375 Chau, M., & Chen, H. (2003b). Personalized and focused Web spiders. In N. Zhong, J. Liu, & Y. Yao (Eds.), Web Intelligence (pp. 197–217). Heidelberg, Germany: Springer-Verlag. Chau, M., Zeng, D., Chen, H., Huang, M., & Hendriawan, D. (2003). Design and evaluation of a multi-agent collaborative Web mining system. Decision Support Systems [Special Issue on Web Retrieval and Mining], 35(1), 167–183. Chen, H., Fan, H., Chau, M., & Zeng, D. (2001). MetaSpider: Metasearching and categorization on the Web. Journal of the American Society for Information Science and Technology, 52(13), 1134–1147. Cohen, E., Krishnamurthy, B., & Rexford, J. (1998). Improving end-to-end performance of the Web using server volumes and proxy filters. In Proceedings of the ACM SIGCOMM Conference on Applications, Technologies, Architectures and Protocols for Computer Communications (241–253). Croft, W.B., Cook, R., & Wilder, D. (1995). Providing government information on the Internet: Experiences with THOMAS. In Proceedings of the Digital Libraries ’95 Conference (pp. 19–24). Retrieved August 30, 2004, from http://csdl.tomu.edu/DL95/contents.html Etzioni, O. (1996). The World Wide Web: Quagmire or gold mine. Communications of the ACM, 39(11), 65–68. Fang, X., & Sheng, O.R.L. (2004). LinkSelector: A Web mining approach to hyperlink selection for Web portals. ACM Transactions on Internet Technology, 4(2), 209–237. Fenichel, C.H. (1981). Online searching: Measures that discriminate among users with different types of experience. Journal of the American Society for Information Science, 32(1), 23–32. Fenstermacher, K.D., & Ginsburg, M. (2003). Client-side monitoring for Web mining. Journal of the American Society for Information Science and Technology, 54(7), 625–637. Hurst, M. (2001). Layout and language: Challenges for table understanding on the Web. In Proceedings of the First International Workshop on Web Document Analysis (pp. 27–30). Retrieved August 30, 2004, from http://csc.liv.9c.uk/~wda2001/ Jansen, B.J., Spink, A., Bateman, J., & Saracevic, T. (1998). Real life information retrieval: A study of user queries on the Web. ACM SIGIR Forum, 32(1), 5–17. Jansen, B.J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36, 207–227. Jones, S., Cunningham, S.J., & McNam, R. (1998). Usage analysis of a digital library. In Proceedings of the Third ACM Conference on Digital Libraries (pp. 293–294). New York: ACM Press. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algo- 1376 rithms (pp. 668–677). Philadelphia: Society for Industrial and Applied Mathematics. Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., et al. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks [Special Issue on Neural Networks for Data Mining and Knowledge Discovery], 11(3), 574–585. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 2(1), 1–15. Montgomery, A.L., & Faloutsos, C. (2001). Identifying Web browsing trends and patterns. IEEE Computer, 34(7), 94–95. Pierrakos, D., Paliouras, G., Papatheodorou, C., & Spyropoulos, C.D. (2003). Web usage mining as a tool for personalization: A survey. User Modeling and User-Adapted Interaction, 13, 311–372. Ross, N.C.M., & Wolfram, D. (2000). End user searching on the Internet: An analysis of term pair topics submitted to the Excite search engine. Journal of the American Society for Information Science, 51(10), 949–958. Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1999)Analysis of a very large Web search engine query log. ACM SIGIR Forum, 33(1), 6–12. Spink, A. (2002). A user-centered approach to evaluating human interaction with Web search engines: An exploratory study. Information Processing and Management, 38, 401–426. Spink, A., Jansen, B.J., Wolfram, D., & Saracevic, T. (2002). From e-sex to e-commerce: Web search changes. IEEE Computer, 35(3), 107–109. Spink, A., & Ozmultu, H.C. (2002). Characteristics of question format Web queries: An exploratory study. Information Processing and Management, 38, 453–471. Spink, A., Ozmutlu, H.C., & Lorence, D.P. (2004). Web searching for sexual information: An exploratory study. Information Processing and Management, 40, 113–123. Spink, A., Wolfram, D., Jansen, B.J., & Saracevic, T. (2001). Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology, 52(3), 226–234. Tolle, K.M., & Chen, H. (2000). Comparing noun phrasing techniques for use with medical digital library tools. Journal of the American Society for Information Science, 51(4), 352–370. Wang, P., Berry, M.W., & Yang, Y. (2003). Mining longitudinal Web queries: Trends and patterns. Journal of the American Society for Information Science and Technology, 54(8), 743–758. Wolfram, D., Spink, A., Jansen, B.J., & Saracevic, T. (2001). Vox populi: The public searching of the Web. Journal of the American Society for Information Science and Technology, 52(12), 1073–1074. Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to Web search results. In Proceedings of the Eighth World Wide Web Conference. Retrieved August 30, 2004, from http://www8.org/w8papers/3a-search-query/dynamic/dynamic.html JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—November 2005