Academia.eduAcademia.edu

SorTables

1995, Proceedings of the fourth international conference on Information and knowledge management - CIKM '95

Much research in information retrieval has focused more on matching results to queries than on browsing those results. After brie y exploring browsing in physical and electronic libraries, we i n troduce SortTables, a new system that focuses on support for browsing. We explore the evolution of the system in light of early implementation experience and formative e v aluation of the interface. Finally, w e brie y review related work, and discuss future directions.

SortTables: A Browser for a Digital Library William C. Wake* and Edward A. Fox*# Department of Computer Science* and Computing Center# Virginia Polytechnic Institute and State University Blacksburg, VA 24061-0106 fwakew, [email protected] Abstract Much research in information retrieval has focused more on matching results to queries than on browsing those results. After brie y exploring browsing in physical and electronic libraries, we introduce SortTables, a new system that focuses on support for browsing. We explore the evolution of the system in light of early implementation experience and formative evaluation of the interface. Finally, we brie y review related work, and discuss future directions. through large areas, skipping the uninteresting ones. Finally, browsing allows for a natural serendipity: one can choose a book at random from a particular area, or from a random area of the library. As with variable focus, one has control over how random one's selection can be. 2 Browsing in the Electronic Library Browsing in the electronic library can retain many of these characteristics, except for access to the physical items. Electronic browsing can provide something not feasible in the physical library: multiple orderings of the items. We're not restricted to ordering strictly by call number: we can think of browsing by author or date. (One could imagine a library where the three-dimensional arrangement has significance; for example, the oor of an item could correspond to year of publication or publication type. However, the particulars of the three-dimensional arrangement are usually only accidental, as shelves are usually linearly organized by call number.) Before computers, card catalogs o ered some of this facility: there were often card catalogs organized by author, subject, and title. It would be unusual, however, to nd a card catalog devoted to organizing items by publication date or date of acquisition. Electronic browsing can overcome some of the limitations of the paper card catalogs. Electronic browsers can be much faster to use (as a patron can sit at a terminal, perhaps in his own oce, rather than moving around the physical catalog). Browsers can provide greater exibility in manipulating the items. For example, it is possible to search on both authors and years, and to present the items in a variety of ways (e.g., sorted by publication date). 1 Browsing in the Physical Library In a physical library, we think of browsing as moving among the shelves, looking at items of possible interest. Several characteristics of this type of browsing stand out: access to the actual items, potential access to all the items, a sense of neighborhood, variable focus, and the opportunity for serendipity. Physical browsing provides access to the actual items. For example, books provide powerful cues about their meaning in a situated way, through physical cues as well as content. A worn-out book might be regarded as more (or less) interesting because of its apparent popularity. A thick book might look like too much work; one with an interesting cover might be chosen instead. Browsing potentially provides access to all the items. One has the sense that one could start at one end and go to the other, looking at each book. This is di erent from trying to retrieve books by generating queries: we don't have to know all the \query terms" in advance, and we don't have to deal with the same item twice (because it was retrieved by di erent queries). The sense of neighborhood is an important reason that browsing is e ective. Since books are organized by call number, and call numbers are assigned in a way that keeps related books together, we often nd books on related topics close together. Browsing allows a variable focus: in a promising area, one can systematically examine all items; when items in that area aren't useful, one can make a sweeping movement 3 SortTables: A System for Browsing SortTables is an interface metaphor developed to support browsing. Information retrieval systems have often lacked a browser, thus forgoing the bene ts of this type of interaction, or they have provided a simple electronic realization of the card catalog, making little use of the manipulative capabilities of the computer. SortTables is not in any sense a complete solution to the problems of digital libraries. It only addresses the problem of presenting and manipulating a set of items; as presented here, its search facilities are limited. It doesn't address issues of how documents are stored or retrieved, protocols, security, To appear in CIKM '95, The 4th International Conference on Information and Knowledge Management, Nov. 28{Dec. 2, 1995, Baltimore, Maryland. Draft: August 14, 1995. 1 Figure 1: The SortTables System or the host of other issues that must be addressed by any full system. In SortTables, all items are presented in a table, where each row corresponds to a record, and each column to an attribute of that record. At any point in time, one of the columns is marked as the sort key, and all rows are sorted according to it. There are four essential functions that can be performed on a SortTables table: movement through the table, sorting the items according to one of the attributes, searching for a particular attribute value, and restricting the items according to a range of values of some attribute. The view is updated after each keystroke. The up- and down-arrows and page keys will move by line or page, the left- and rightarrows change the sort key column, alphanumeric characters search incrementally (as each character is typed), and a single keystroke can delete items not in a desired range. Figure 1 shows the system running on Envision's data. (Envision is a database of computer science literature under construction at Virginia Tech [7] [8].) Note that the current line and the sort column are highlighted; items are sorted according to values in that highlighted column. (The highlighted column appears bold or dark gray on the screen; the '=' divider under the column title marks it as well.) If there were an active search string, it would appear after the text `Find:'. There is a progress indicator showing approximately where the current line is relative to the list of items (here, 98% through a list of approximately 100K items). With a few keystrokes, we could restrict our browsing to sets of items such as: \Journal articles only, authored by Knuth, sorted by year." is `IBM Journal of Research and Development' and Year  1980, sorted by Author," we could proceed as follows. Initially, title is the rst column. Press the right arrow key to sort by journal name, then type `ibm j' to move to the entry for that journal. Typing `=' will restrict the table to those items that match the search string, with results shown in Figure 2. Next we can move to the year column, and type `198<' to delete items before 1980, with results shown in Figure 3. Pressing the right-arrow key twice will sort the remaining items by author, with results shown in Figure 4. Notice that we are seeing the same item (the one authored by Adams) in its neighborhood according to the new sort column. So, with about a dozen keystrokes, we have made a moderately complex restriction, and can browse a useful subset of our data. 4 Design Goals and Early Implementation The SortTables system was designed with a number of goals in mind:  Simple metaphor with a high degree of interactivity  Use of ranges and multiple orderings  Ability to view the whole database  Integrated browsing and searching  Progressive utility  Engender a sense of progress The central metaphor is an automatically sorting table that can have parts deleted based on attribute values. The idea of reacting to each keystroke has been present from 3.1 Example Suppose we have the columns Title, Journal, Year, Type, Author, and Terms. To browse items for which \Journal 2 Figure 2: Restricted to `IBM J' Figure 3: Journal = `ibm j' and Year  1980 3 Figure 4: Journal = `ibm j' and Year  1980; sorted by Author early on: rather than a command being typed, or a form being lled in while the system passively records keystrokes, this system is active while each key is typed. An electronic replica of a card catalog might allow for range restriction on a single attribute. This metaphor extends that capability with something real card catalogs can't do: easily sort themselves based on several attributes, and allow restrictions based on them. The ability to systematically view the whole database is a characteristic of many browsers. SortTables supports that by allowing one to page through all items based on any of the attributes. Many systems have distinct modes for searching and for browsing: in search mode, a query retrieves a set of documents; in browse mode, entities (or attribute values) are listed. Some systems, such as the Computing Archive [1], integrate both modes, allowing a user to browse an author list while forming a query. The SortTables system is similar: the user is always browsing, but can form queries while doing so. The system allows for progressive levels of utility: movement through the list, searching, sorting, and range restriction provide increasing levels of sophistication. Finally, the system tries to engender a sense of progress in its use. This metaphor lets the user systematically restrict a multi-dimensional space by dealing with one dimension at a time. Actions either \look around" or they close in on a particular area of the data. The system has gone through several versions, with two di erent concerns: exploring the interface, and developing the underlying implementation. Here is how the successive versions have addressed these issues: 1. Spreadsheet-based prototype. The rst version was built on a spreadsheet, by adding buttons that would sort according to the various columns, and using native spreadsheet facilities for searching and scrolling. 2. Graphical user interface on a NeXT. This version was a prototype designed using the NeXT Interface Builder, and was used to explore implementation techniques in a graphical environment. 3. VT100 interface|initial version. The rst VT100based version was developed to provide a stable interface portable across a variety of platforms. This version has been used as a testbed for formative evaluation of the interface. 4. VT100 interface|current version. The current version has incorporated a number of improvements suggested from formative evaluation, and has been used as a testbed for data structures capable of supporting large data sets. 5 Formative Evaluation A formative evaluation of the rst VT100-based version was performed, to improve the system before over-committing to a particular interface design. The evaluation took two forms: an interface designer (a Ph.D. student in the area of human-computer interaction) critiqued it, and two students served as users in a usability test. At the time, the system performed reasonably well with up to about ve thousand items. (The worst case for sorting was less than ten seconds; other operations were faster). The results of this evaluation were used in developing the current version. The interface critic suggested adding redundant visual coding to indicate the current sort column and a progress indicator showing about how many items were left. He helped tune some of the assignments of functions to keys. Finally, he encouraged future development of a graphical version. 4 <(1,8), (1,8)> <(1,4), (1,6)> <(5,8), (2,8)> <(5,6), (2,8)> <(3,4), (4,6)> <(1,2), (1,5)> <(3,3), (4,4)> <(1,1), (1,1)> <(2,2), (5,5)> <(5,5), (8,8)> <(4,4), (6,6)> <(7,8), (3,7)> <(7,7), (7,7)> <(6,6), (2,2)> <(8,8), (3,3)> Figure 5: Bounds information tree for the rst attribute The students in the usability test worked with three different data sets: directory information resulting from the Unix ls -l listing command (25 items in 7 columns), data extracted from the CIA world fact database (250 items in 6 columns), and a collection of library catalog records (about 5000 items in 5 columns). These were selected to test the system on a variety of types of data sets. The core of the interface was retained, but several changes were made to ease interaction. Page-up and -down keys were added, to improve navigation. Two functions were removed: undo, and the deletion of items not equal to the pattern. The training material was modi ed to put increased emphasis on the notion of the current sort column. Finally, sorting performance was identi ed as a potential bottleneck. Currently, the system supports on the order of a million records, so the interface evaluation should be revisited with this larger data set. There is a \known bad" bit associated with each node of the tree, set whenever the (sub-)tree is known not to contain any possibly useful children. To evaluate a query, the tree for the desired ordering is used. A tree walk compares the nodes to the query:  If the node is marked \known bad," return failure.  If the node is a record (i.e., a leaf) and its bounds are valid, return that record.  If the bounds don't intersect the query, mark the node \known bad" and return failure.  Otherwise, walk both children. If upon returning, both children are marked \known bad", then mark this node \known bad" before returning failure. The \known bad" bit is set along the way, so a node will not be evaluated again if it has already failed. The nature of the interface ensures that succeeding queries will be tighter than their predecessors, so bad nodes won't suddenly become useful again. For example, suppose we use the tree to locate items according to the range restriction <(4,8), (7,8)>. The top node intersects that range (of course), so we continue down the tree. Moving down, the node labeled <(1,4),(1,6)> does not intersect the query range, so we mark it \known bad" and ignore its children. On the other side of the tree, <(5,6),(2,8)> intersects the range, as do both its children. We will examine the leaf nodes: <(5,5),(8,8)> is good, <(6,6),(2,2)> is not, <(7,7),(7,7)> is good, and <(8,8),(3,3)> is not. The good nodes are returned as valid records, and the bad ones are marked \known bad." 6 Data Structure To support the interface, a new data structure is being developed: the thread le. Its key idea is to use bounds information for subsets of the data to avoid searching many records. 6.1 The Basic Idea The bounds of a set of records is a mapping from each attribute to its minimum and maximum values in that set. Suppose we have three records A, B, and C, with the values <3,4,5>, <0,3,4>, and <2,2,2> respectively. The bounds are: <(0,3), (2,4), (2,5)>. This says that the rst attribute ranges from 0 to 3, the second from 2 to 4, and the third from 2 to 5. There is a complete binary tree of bounds information for each attribute. Figure 5 shows an example of one such tree (for the rst attribute). For simplicity, the tree shows a data set with only two attributes. Nodes are labeled with the bounds information for each attribute, in the form: < (low1; high1 ); (low2; high2 ) >. Figure 5 graphically shows how each parent's box is the bounding box of its two children. Notice also how successive levels of the tree split the box based on the rst attribute. 6.2 Re ned Data Structure As described, this structure makes great demands on main memory. To reduce these demands, the \known bad" bits are kept in memory, while the rest of the tree is demandpaged from disk. To reduce space, we want to avoid storing layers of the tree which provide us little information. For example, upper layers of the bounds tree aren't useful: they tend to enclose almost all of the data. For each attribute, we keep a list of the records, in the order they appear when sorted by that attribute. For the 5 Bytes 2 8K N=B KN=(8B) 4KN KN=8 4KN Location Type of data memory memory disk memory disk 7 Current Status The system has been loaded with the two large data sets described in Table 2 above. The Envision data set consists of bibliographic data extracted from the Envision data base. The library data set consists of information extracted from the university's online card catalog. We have not formally assessed performance, but the following gures should give an idea of the interaction, for the system running on a DEC Alpha workstation with either database. Moving by line or page feels instantaneous (the system appears to spend all its time updating the screen). Changing the sort column happens almost as quickly. Searching by typing characters takes up to about two or three seconds. Finally, restricting items by range takes about ve to fteen seconds. These times should improve with further algorithm development and performance tuning, but they suce to give the system a highly interactive feel. Slice of bounds tree Bound bits (\known bad") Leaf layer of bounds tree Record bits (\known bad") Buckets Table 1: Space required by the thread le. K is the number of columns, N the number of records, B is the number of records per bucket. Each index is assumed to take 4 bytes. Data Set Envision Library Records (N ) 100,000 1,000,000 Columns (K ) 6 5 Records/Bucket (B) 16 64 Record space (disk) 22M 167M Thread le space (disk) 7M 47M Memory required 2M 4M Table 2: Actual space overhead for thread le (rounded to the nearest megabyte). \Memory required" does not include cache space for information on disk. 8 Related Work Chang and Rice [3] survey browsing from a variety of perspectives. They distinguish browsing from other types of information seeking, and develop a taxonomy of dimensions records A, B, and C above, the lists are: <B,C,A> (when along which to analyze browsing: contextual, behavioral, sorted by the rst attribute), <C,B,A> (the second), and motivational, cognitive, and resource-based. The work re<A,B,C> (the third). When there are no restrictions outported here ts best with what they call the library, informaside the current sort order, this enables straightforward distion science, and information retrieval traditions: browsing play of the records in order. for unplanned discovery and as a problem-solving technique. These lists are divided into buckets, whose size is chosen Thompson and Croft [12] address the bene ts and reto be convenient relative to the disk block size. quirements of browsing, and point out that browsers need a The current implementation maintains only three layers rich set of links, a exible user interface, and search capaof the tree: the level for which bounds information correbilities. They have a graphical system in which document sponds to a disk block (64 records on our system), the level and concept neighborhoods can be browsed. Crouch et al. [5] for buckets of B records, and the bounds information for discuss a similar system focused on browsing clusters of docindividual records. The lower layers of the tree provide very uments. ne-grained information, but they are so numerous that they Development of the SortTables interface took place in are expensive to track. (The Envision data set has about the context of the MARIAN [6] and Envision [7] systems at 100K items; this means 100K position records, plus 10K Virginia Tech. MARIAN uses a vector model, and provides bounds records per attribute. If the full tree were stored, simple lists of ranked results, but lacks a browsing capabilit would require an additional 90K bounds records for each ity. Envision uses MARIAN's search engine, but adds an attribute.) unusual tool for viewing results: a \bubble view" of icons, Table 1 shows the space requirements of this leaner data laid out in a exible two-dimensional arrangement. Envistructure. Items whose location is \memory" reside permasion is limited in browsing abilities: although the display nently in main memory; \disk" items are loaded on demand is exible, it only supports viewing at most a few hundred and cached. Table 2 shows these requirements for the two items from a previously made query. The SortTables inlarge data sets we have loaded. terface simpli es Envision's results display to a textual list, but provides exible ordering, range restriction, and direct 6.3 Limitations access to all the data. The data structure to support the interface has two asThis data structure is undergoing re nement and tuning. orthogonal range query, and sorting. (\Orthogonal We plan to add a small auxiliary index in the style of Glimpse [9] pects: range query" refers to queries of the form \8i : mini  to support regular expression queries. vali  maxi " as required by the interface.) Many researchers The thread le data structure does not have uniform perhave proposed data structures to support such queries, e.g., formance for all queries. When there are very few records k-d trees [2] [13], ltering search [4], and grid les [10]. Unleft, there can come a point where it spends more time lookfortunately, none of these address the requirement for sorting at all the records that don't match the query than at ing. The grid le uses a large multi-dimensional array to nding those that do. For the Envision data set, this was store buckets of data, indexed by a set of one-dimensional not a problem, but for the library data set the delay became scales. The others are tree-based. Our data structure was noticeable when there were fewer than about 5000 records. directly inspired by the grid le, although it resembles the This problem is solved by switching representations when others in using trees. result sets become small. 9 Future Directions Several aspects of this system are targeted for work over the next months: 6      Improved algorithms and data structures. The system has reasonable performance on a DEC Alpha workstation, but the performance could be improved. It would be nice to have an algorithm that supported incremental additions to the database, and that allowed multiple simultaneous users in an ecient way. Several users could share trees, but they would need separate \bad" bits and other information. Other application areas. SortTables is well-suited for data sets that have a moderate number of interesting attributes. In addition to the bibliographic, geographic, and directory data sets mentioned above, we have explored using it with data from e-mail and calendar applications. These data sets share the characteristic that rapid sorting and range restriction based on attribute values provide useful manipulations. Networked interface. The current system is monolithic. It would be useful to split it into a networked client-server implementation, to allow broader use and to facilitate exploration of the thread le data structure with multiple concurrent users. Graphical interface. While the early prototypes used a graphical interface, later versions have evolved away from that. The graphical interface would have radio buttons as attribute labels (to choose the sort column) and a scroll bar for navigation. It would retain the single-window design, keeping the search eld on the window with the browsing list. A Tcl/Tk version [11] is currently being designed. Evaluation. To assess the utility of this approach, it should be evaluated in real use. With the library card catalog data loaded, we hope this will make the system potentially useful to a broad class of users, so it can be e ectively evaluated. [6] Edward A. Fox, Robert K. France, Eskinder Sahle, Amjad Daoud, and Ben E. Cline. \Development of a Modern OPAC: From REVTOLC to MARIAN," SIGIR'93: Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, ACM, 1993, pp. 248-259. [7] Edward A. Fox, Deborah Hix, Lucy T. Nowell, Dennis J. Brueni, William C. Wake, Lenwood S. Heath, and Durgesh Rao. \Users, User Interfaces, and Objects: Envision, a Digital Library," Journal of the American Society for Information Science, 44(8), 1993, pp. 480491. [8] L. Heath, D. Hix, L. Nowell, W. Wake, G. Averboch, and E. Fox. Envision: A User-Centered Database from the Computer Science Literature. Communications of the ACM, 38(4), Apr. 1995, pp. 52-53. [9] U. Manber and S. Wu. \GLIMPSE: A Tool to Search Through Entire File Systems," Technical Report No. TR 93-34, University of Arizona Department of Computer Science, 1993. [10] J. Nievergelt, H. Hinterberger, and K. C. Sevcik. \The Grid File: An Adaptive, Symmetric, Multi-Key File Structure," ACM Transactions on Database Systems, 9(1), 1984, pp. 38-71. [11] John Ousterhout. Tcl and the Tk Toolkit, Reading, MA: Addison-Wesley, 1994. [12] R. H. Thompson and W. B. Croft. \Support for browsing in an intelligent text retrieval system," International Journal of Man-Machine Studies, Volume 30, 1989, pp. 639-668. [13] D. E. Willard \New Data Structures for Orthogonal Range Queries," SIAM Journal on Computing, 14(1), 1985, pp. 232-253. 10 Acknowledgements This work was supported in part by the National Science Foundation through CISE Institutional Infrastructure (Education) Grant CDA-9312611. It has bene ted from the criticism and discussion of several colleagues, especially Dr. Kevin Mayo of SAIC. References [1] ACM. Computing Archive: Bibliography and Reviews from ACM. New York, NY: ACM Press, 1991. [2] J. L. Bentley. \Multidimensional Binary Search Trees used for Associative Searching," Communications of the ACM, 18(9), 1975, pp. 509-517. [3] Shan-Ju Chang and Ronald E. Rice. \Browsing: A Multidimensional Framework," in Martha E. Williams, ed.: Annual Review of Information Science and Technology (ARIST), Volume 28, 1993. [4] B. Chazelle. \Filtering Search: A New Approach to Query Answering," SIAM Journal on Computing, 15(3), 1986, pp. 703-724. [5] Donald B. Crouch, Carolyn J. Crouch, and Glenn Andreas. \The Use of Cluster Hierarchies in Hypertext Information Retrieval," in Hypertext '89 Proceedings, Pittsburgh, PA, ACM, 1989, pp. 225-237. 7 View publication stats