Academia.eduAcademia.edu

SWSE: Answers Before Links

2007, Iswc

We present a system that improves on current documentcentric Web search engine technology; adopting an entity-centric perspective, we are able to integrate data from both static and live sources into a coherent, interlinked information space. Users can then search and navigate the integrated information space through relationships, both existing and newly materialised, for improved knowledge discovery and understanding. 1

SWSE: Answers Before Links! Andreas Harth, Aidan Hogan, Renaud Delbru, Jürgen Umbrich, Sean O’Riain, and Stefan Decker National University of Ireland, Galway Digital Enterprise Research Institute [email protected] Abstract. We present a system that improves on current documentcentric Web search engine technology; adopting an entity-centric perspective, we are able to integrate data from both static and live sources into a coherent, interlinked information space. Users can then search and navigate the integrated information space through relationships, both existing and newly materialised, for improved knowledge discovery and understanding. 1 Introduction Todays best in class search engines such as Google, Yahoo and MSN offer search over web documents. Paramount to their success is their data gathering prowess and their ability to allow consumers quickly and efficiently find documents matching a set of keywords. Results in such sites are simply links to documents; users are required to manually traverse these documents in order to achieve the answers to their information needs. Ranking tries to present relevant documents first, but little effort is made to enhance, extract or integrate data to provide precise answers, such as “who are the friends of Rudi Studer”, “what organism pertains to the Protein SHNF1”, or “where is SK Telekom located”? In addition, most current search engines offer little support for disambiguation or refinement of results: adroit keyword query construction and reformulation is required to avoid trawling through a quagmire of irrelevant results. More recent developments in the search space have seen Vivisimio1 offering search and clustering capabilities, where search is initially driven by keyword but also features filtering using generated clusters which provides the ’context’ of the results. Quintura2 also offers dynamic cluster creation based on keyword queries, with analysis of the contextual relationships between keywords. The result is a visual semantic keyword cloud that the user may call upon to further filter the result set. However, all of these tools are still based upon the traditional documentcentric view of knowledge rather than real-world entities and their relationships. On the other hand, search engines operating on structured data sources, such as 1 2 http://vivisimo.com/ http://quintura.com/ A9’s Web 2.0 offering3 , only merge information visually at the syntactic level. True integration at the data level is not attempted. Broaching these key issues, SWSE (Semantic Web Search Engine) performs semantic integration of structured data: not only from the Web but also from monolithic data sources such as XML database dumps, large static datasets and even live sources; this is achieved using a hybrid data integration solution which amalgamates the data warehousing and on-demand approaches to integration. From this integration emerges a large graph of RDF entities with inter-relations and structured descriptions of entities: archipelagos of information coalesce to form a coherent knowledge base. This perspective of knowledge is a better reflection of its subject than the traditional document-centric philosophy. Entities are typed according to what they describe: people, locations, organisations, publications as well as documents; entities have specified relations to other entities: people know other people, people author documents, organisations are based in locations, and so on. Since the entity-centric model closely reflects the real (and online) world, it becomes viable to develop a search and query engine which users will find intuitive to use. Users initially define a keyword search to hone in on relevant entities, results can then be refined according to type; users can then navigate to and from entities through known relations. Thus, rich descriptions of diverse entities are easily retrievable within the interface. Where the required data is not available as structured data, SWSE bridges the gap to traditional search by offering links to documents which are related to the entity. As we will see, SWSE, by effecting such an entity-centric modus operandi, can truly offer Answers before Links. 2 Example Session SWSE’s user interaction model depends on an entity-centric view of the world: the cognitive model assumes entities with attributes, and relations between entities. The user interface primitives for operating in this space are: keyword matching in attributes, filtering results by entity type, and navigation of relations between entities. In the following, we describe a use case that shows how SWSE outperforms current web search engines in term of information search potential. SWSE’s entity-centric interface extends current search functionalities and assists the user’s tasks along two dimensions: – navigational - reaching a specific entity and exploring its surrounding entities; – informational - acquiring an extensive representation of an entity. The use-case we introduce is focused on the task of learning more about a person: Rudi Studer. We examine how a user can procure, with minimal effort, an extended description of a person coming from a multitude of sources 3 http://a9.com/ Fig. 1. Results list page after typing in keyword. and describing not only his personal information (contacts, interests, work environment) but also his surrounding entities (people he knows, his projects, his organisation). Please note that the interface, although demonstrated on browsing a social network, is domain independent and can similarly be used to navigate information spaces in other domains as well. In a common search engine, gathering information about an entity can be a cognitively intensive activity. For example, a user will enter the person’s name as keyword key and will get a list of web pages as results. Then, the user must browse and review several web pages and manually filter the information, possibly reformulating the keyword search with extra terms in order to try to find new pages and eliminate irrelevant ones. To gather and assemble a coherent description of the person, current search engines can demand needless user energy and time. With SWSE, a user will start, as with a common search engine, by entering the keyword query “rudi studer”. The result of the search is shown in Figure 1, which is a list of all entities matching the keyword (area 1 in the Figure) accompanied with a small summary description. The entities often have various types, such as Person, Document, Professor (2 in the Figure). In order to refine the search, the user can click on “Person” to filter the results by entity type and get only a list of “Person” entities. If users click on the first result (the Person Rudi Studer), they get presented (as captured in Figure 2) with a detailed information page about the person. The information about the entity can be aggregated from multiple sources and is presented in a homogeneous view (in this example nearly 200 sources contribute to the Rudi’s representation). Here, users can find Rudi’s homepage, telephone and fax number, people he knows, and so on (3, Figure 2). Users can then continue their exploration of the surrounding environment of Rudi just by following the semantic links. In addition to following outgoing links, users find an overview of Fig. 2. Entity page containing information about Rudi aggregated from multiple sources. incoming relations to the focus instance in a column on the left side of the pane (4). Users are shown that Rudi has authored 198 papers, that 48 people know him, that he is the maker of a file and editor of four things. SWSE also assists users by linking various RDF entities with named entities found in full-text information (cf. Section 3.2) via additional navigational “see also” shortcuts. For example, when the entity “Rudi Studer” is identified in a document, SWSE is able to instantly provide its full description and to suggest related and relevant documents. 3 Architecture The architecture is an adaptation of search engine and database/data warehousing architectures. Figure 3 illustrates the high-level architecture and the data flow within the system. 3.1 Semantic Search and Query Engine The core of SWSE is the Semantic Search and Query Engine as depicted in Figure 3. YARS2 is a scalable distributed architecture for indexing and querying large RDF datasets and operates on a named graph data model, whereby Fig. 3. Semantic Web Search Engine architecture consisting of data preparation and integration phase and Semantic Search and Query Engine. RDF triples are extended with context, which encodes the source of data, forming a quadruple (subject, predicate, object, context). Within YARS2, there are packages for providing local index creation and management, distributed query processing, and runtime ranking of results. The Index Manager creates and services lookups on local keyword and quad (named graph) indices. The Query Processor co-ordinates with several Index Managers over the network and offers a SPARQL end-point. ReConRank[2] is used to rank entities in the result-set providing metrics for the importance of particular entities and also the trustworthiness of data sources; these metrics are used for ordering the presentation of results in the UI. 3.2 Data Preparation and Integration In the following, we briefly describe the process of collecting and integrating data from a plethora of sources in a multitude of formats. In order to acquire our raw dataset, we employed MultiCrawler[1]. To be able to demonstrate the use of real-world heterogeneous information sources with diverse ownership, we crawled native RDF sources, RSS streams, and HTML, MS Office, PS and PDF documents with MultiCrawler extracting metadata and converting to RDF where necessary. Table 1 shows some statistics about the currently used data set. We enrich and interlink the base dataset using entity consolidation (a.k.a. object consolidation)[3]. As seen, SWSE integrates data from multiple sources Description Value Number of statements 250,298,954 Data size uncompressed 48 GB Text index size 7.1 GB Quad index size 23.4 GB Table 1. Data size and index statistics. and often, within RDF, different sources may describe the same entities providing complementary data on a particular entity. When a common URI is used to identify the entity, data integration is automatic under the identifier; when URIs are not provided or do not match, we use entity consolidation to identify matches through analysis of values of inverse functional properties. The goal here is to avoid having the knowledge contribution of entities split over numerous instances: i.e., to have a 1:1 ratio of entities to results in the UI. In addition, we perform entity linking achieving the “see also” links: we use our set of entities from structured data sources as the crystallisation points around which metadata from poorly structured data sources (mostly HTML documents) are arranged. Web documents (HTML pages, RSS feeds) are mainly unstructured but are widespread in the Web and remain a useful source of information that should be leveraged. In order to enable users to locate relevant information from the abundance trapped in poorly structured data, we link web documents with the existing RDF entities. We first create an inverted index over the text of such documents. Then, for each RDF entity, we query the inverted index by using one specific property; e.g., we use foaf:name for matching foaf:Person entities. Finally, for each query hit, we create an association between the web document and the RDF entity. 3.3 On-Demand Integration Finally, in the architecture, we provide for runtime querying of external live datasources. Wrappers can be plugged into the architecture for querying external sources: the wrappers provide the same interface as an Index Manager and so can be handled by the Query Processor. Each wrapper handles a particular format of external source (e.g., OpenSearch, SPARQL) and handles multi-threaded access to multiple of such sources; the wrappers incorporate Squid caching. Please note, we have currently not enabled any wrappers for the demo. 4 Lessons Learnt As one would expect, whilst designing, developing and implementing this architecture and its components, there were many lessons learnt and numerous conclusions arrived at. Our first observation was regarding the importance of extending the RDF model with context. Under the RDF model, whereby community driven knowl- edge bases are encouraged, anyone can say anything about any resource anywhere. Thus, tracking the source of data is vital to maintaining the integrity of the information provided to users. In fact, we created ReConRank on the premise that sources should also be ranked for a particular result set, offering metrics on trustworthiness and thus taking contextual information into account in the ranking procedure. Secondly, we learnt the importance of extending our knowledge base with data from non-RDF sources. By only indexing RDF data, the sphere of knowledge indexed was quite limited; thus we created MultiCrawler to crawl and transform other data sources such as HTML and RSS and starting using entitylinking to create the see also links. Yet another aspect we discovered: sometimes it is not feasible to crawl large database-backed sites with millions of exported files, since harvesting the entirety of such a site at a rate of one page every ten seconds would take in the order of months. We see two alternatives to remedy this problem: either the sites provide data dumps of their datasets for download, or provide a SPARQL interface which allows for on-demand integration of these sources. To be able to detect and utilise the data dumps automatically, we propose to use an extension of the sitemap protocol for semantic crawling4. In addition to pointing to data dumps, the sitemap extension allows to discover SPARQL endpoints on the Web. Data quality is an issue, too. The native data we acquire is sparsely interlinked, since URI’s often do not match up: sometimes agreement cannot be reached, othertimes a new URI is created in ignorance of pre-existing ones. Applying entity consolidation becomes a powerful tool in such a scenario that can dramatically improve the quality of the dataset. However, object consolidation can “incorrectly” consolidate instances referring to different entities. This can be attributed to two main factors. Firstly, dud values are often assigned to inverse functional properties such as, “N/A”, “foo” or “ask” etc.; these properties match for instances referring to different entities and cause incorrect consolidation. Secondly, properties are sometimes used in a manner contrary to their formal definition as being inverse functional. Perhaps the current dearth of URI agreement could be attributed to the absence of a site where data providers could view aggregated information and verify the integrity of their data files; we see SWSE as filling this niche and supporting the interlinking and reuse of identifiers for entities on the Web, i.e. SWSE in the short term can act as a reference tool for data providers. Thus, we hope that SWSE will help motivate the production of better quality data. In general, dealing with web data is more difficult than data provided and managed by an enterprise. We have achieved higher data quality and thus improved browsing capabilities in vertical search settings. Given a confined domain, it is possible to arrive at datasets of better data inter-linkage and data quality using a few high-quality sources. In particular, entity consolidation performs exceptionally in cases where different data sources use different identifiers to denote entities (e.g. ticker symbol, cik, CUSIP in the area of securities). The 4 http://sw.deri.org/2007/07/sitemapextension/ user interface presented here is domain independent and does only depend on a few selected RDF primitives (such as rdf:type and rdfs:label); however, in an enterprise setting, making the user interface domain-specific can facilitate much more powerful browsing and navigation functionality at the expense of generality. 5 Conclusion We have described the application of semantic web technologies to the scenarios of entity-centric search and navigation and large scale web data source integration. The current system is available online5 , and we also provide an experimental SPARQL endpoint6 . The SWSE architecture features components to crawl, transform, enhance, integrate, index and provide advanced querying and browsing of data from a plethora of sources and formats. SWSE reuses hundreds of thousands of RDF sources on the Web, and assumes a completely open world: new RDF data can easily be integrated, without any change in the architecture or user interface. In summary, we provide a complete end-to-end system for advanced web search. Of interest to commercial parties is SWSEs capability to provide enhanced and accurately linked information spaces, covering both Web and Intranet document and data repositories, which can then be used for domain specific browsers in a manner that current search engines do not allow. Particular attention is paid to scale in the architecture. We still have many data-sources to exploit and we foresee both the quality and quantity of RDF data and live data-sources increasing; we look forward to such developments and foresee our system as being able to scale accordingly. We are currently implementing additional OWL reasoning primitives that go beyond the ground equality reasoning required for entity consolidation. Thus, we hope to continue to improve and provide Answers before Links in a new generation of web-search. References 1. A. Harth, J. Umbrich, and S. Decker. Multicrawler: A pipelined architecture for crawling and indexing semantic web data. In International Semantic Web Conference, pages 258–271, 2006. 2. A. Hogan, A. Harth, and S. Decker. ReConRank: A Scalable Ranking Method for Semantic Web Data with Context. In 2nd Workshop on Scalable Semantic Web Knowledge Base Systems, 2006. 3. A. Hogan, A. Harth, and S. Decker. Performing object consolidation on the semantic web data graph. In Proceedings of 1st I3: Identity, Identifiers, Identification Workshop, 2007. 5 6 http://swse.deri.org/ http://swse.deri.org/yars2/