Academia.eduAcademia.edu

Querying Datasets on the Web with High Availability

2014, Lecture Notes in Computer Science

As the Web of Data is growing at an ever increasing speed, the lack of reliable query solutions for live public data becomes apparent. sparql implementations have matured and deliver impressive performance for public sparql endpoints, yet poor availability-especially under high loads-prevents their use in real-world applications. We propose to tackle this availability problem by defining triple pattern fragments, a specific kind of Linked Data Fragments that enable low-cost publication of queryable data by moving intelligence from the server to the client. This paper formalizes the Linked Data Fragments concept, introduces a client-side sparql query processing algorithm that uses a dynamic iterator pipeline, and verifies servers' availability under load.

Querying Datasets on the Web with High Availability Ruben Verborgh1 , Olaf Hartig2 , Ben De Meester1 , Gerald Haesendonck1 , Laurens De Vocht1 , Miel Vander Sande1 , Richard Cyganiak3 , Pieter Colpaert1 , Erik Mannens1 , and Rik Van de Walle1 1 Ghent University – iMinds, Belgium {firstname.lastname}@ugent.be 2 University of Waterloo, Canada [email protected] 3 Digital Enterprise Research Institute, nui Galway, Ireland [email protected] Abstract. As the Web of Data is growing at an ever increasing speed, the lack of reliable query solutions for live public data becomes apparent. sparql implementations have matured and deliver impressive performance for public sparql endpoints, yet poor availability—especially under high loads—prevents their use in real-world applications. We propose to tackle this availability problem by defining triple pattern fragments, a specific kind of Linked Data Fragments that enable low-cost publication of queryable data by moving intelligence from the server to the client. This paper formalizes the Linked Data Fragments concept, introduces a client-side sparql query processing algorithm that uses a dynamic iterator pipeline, and verifies servers’ availability under load. The results indicate that, at the cost of lower performance, query techniques with triple pattern fragments lead to high availability, thereby allowing for reliable applications on top of public, queryable Linked Data. Keywords: Linked Data, Linked Data Fragments, querying, availability, scalability, sparql 1 Introduction The past few years, the performance of sparql endpoints has increased steadily. In spite of all this progress, reliable queryable access to public Linked Data datasets largely remains impossible due to the low availability percentages of public sparql endpoints. As of end-2013, the average sparql endpoint is down for more than 1.5 days each month [4]. This means we cannot build reliable applications on top of queryable public data. No matter how fast sparql implementations become, if their availability does not increase, no one will take the risk of depending on public data providers to provide querying for their applications. Availability, not performance, is currently the main threat to the success of the Semantic Web as a viable technology for today’s challenges. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 2 Ruben Verborgh et al. To circumvent the availability issue, consumers who want to query public data typically download a data dump and host their own private sparql endpoint. While this seems to solve the issue, it has the following drawbacks: – Hosting an endpoint requires (possibly expensive) infrastructural support and involves (often manual) set-up and maintenance. – The data in the endpoint is not guaranteed to be up-to-date. – Each dataset required by any of the desired queries must be fully loaded into the endpoint, even if only a small part of that dataset is actually needed. Furthermore, querying a local machine can hardly be considered Web querying, as everything happens offline. Making the Semantic Web vision scalable by downloading and querying all data locally seems an unsatisfactory paradox. In order to advance towards a solution for high-availability Web querying, Linked Data Fragments (ldfs) [27] were proposed as a framework to analyze all possible ways of publishing parts of a Linked Data dataset, ranging from sparql endpoints with highly specific results to data dumps that contain the entire dataset. In particular, this framework allows to define specific types of fragments that can be generated with minimal effort by servers, while still enabling efficient client-side querying. One such type are triple pattern fragments (formerly called basic Linked Data Fragments [27]), which offer triple-patternbased access to a dataset. In this paper, we show that client-side query processing using triple pattern fragments allows live querying with high availability and scalability of public datasets. This result demonstrates that this enables reliable query execution on the Web of Data, with minimal server-side cost. First, the next section discusses related work on querying rdf-based datasets on the Web. We then provide a formalization of Linked Data Fragments in Section 3, followed by a client-side, iterator-based query execution algorithm in Section 4. Section 5 contains the availability evaluation and discussion. We conclude the paper in Section 6. 2 Related Work On the current Web, several http interfaces that provide access to triple-based data are available. We will discuss public sparql endpoints, Linked Data publishing, and other http interfaces for triples, as well as their querying methods. 2.1 Public sparql endpoints The sparql language [12] is the wc standard to query a collection of rdf triples [16]. Many triple stores, such as Virtuoso [5] and Jena tdb [11], offer a sparql interface. Even though current sparql interfaces offer high performance [3, 18, 22], individual queries can consume a significant amount of server processor time and memory. In fact, it has been shown that the evaluation problem for sparql is pspace-complete [20]. Like any high-performance database server, sparql servers with high demand are generally expensive to host, which is further complicated for public servers because of unpredictable loads. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability 3 The current de-facto way for providing queryable access to triples on the Web is the sparql protocol [6]: clients send sparql queries through a specific http interface; the server executes these queries and responds with their results. This contrasts with the majority of machine-to-machine http interactions on the Web, where the server implements a rigidly structured api through which clients access the data. Such apis purposely limit the kind of queries a client can ask, as it allows those servers to place a bound on the computation time needed for each api request [27]. With sparql endpoints, clients can demand the execution of arbitrarily complicated queries1 [6]. Furthermore, since each client requests unique, highly specific queries, regular http caching mechanisms are ineffective, since they can only optimize repeated identical requests. These factors contribute to the low availability of public sparql endpoints, which is documented extensively [4]. It is important to note that this low availability is not the result of poor performance: as indicated by multiple benchmarks [3, 18, 22], many sparql implementations deliver very high performance. Instead, it is the consequence of the architectural decision of the current sparql protocol, which demands the server responds to highly complex requests [27]. This makes providing reliable public sparql endpoints an exceptionally difficult challenge, incomparable to hosting any other public http server. 2.2 Linked Data Perhaps the most well-known alternative interface to rdf triples is described by the Linked Data principles [2] which, not coincidentally, align with the Web’s architectural constraints [8]. The principles require servers to publish documents (“subject pages”) with triples about specific entities, which a client can access through their entity-specific uri, a practice which is called dereferencing. Each of these Linked Data documents contains triples that mention uris of other entities, which can be dereferenced in turn. Serving such documents is like serving html files, which does not require much processor time or memory, so hosting them at low cost is straightforward. Several Linked Data querying techniques [14] aim to use dereferencing to solve sparql queries over the Web of Data. This process happens client-side, so the availability of servers is not impacted. The Linked Data publishing and querying strategy has two main drawbacks. First, query execution times are high, and many queries cannot be solved (efficiently). For example, it is nearly impossible to directly answer the following seemingly simple query for any given dataset: SELECT ?person WHERE { ?person a <http://xmlns.com/foaf/0.1/Person> } A client could try to fetch the url http://xmlns.com/foaf/0.1/Person but, because of the Web’s unidirectional linking structure, the document at that url cannot possibly link to all instances of foaf:Person. In fact, it does not link to any, so the query execution yields an empty result. 1 Many endpoints allow to only expose a subset of all sparql queries, for instance, by limiting the allowed execution time. However, even under those circumstances, the availability of public sparql endpoints remains low [4]. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 4 Ruben Verborgh et al. Second, documents about an entity are looked up through dereferencing, and the uri of an entity only points to the single document on the server that hosts the domain of that uri. For example, the uri http://dbpedia.org/resource/ Barack_Obama leads to triples about Barack Obama on the dbpedia server, not to the triples hosted on other sources that also have data about Barack Obama, such as the bbc or the New York Times. And even though dbpedia could link to those sources, this is entirely up to the server’s discretion. While anybody can reuse the dbpedia uri to add triples about an entity, it is highly unlikely that those triples are considered by Linked Data querying. This contrasts with sparql endpoints, which can provide data about resources with any uri. 2.3 Other http interfaces to rdf triples Finally, several other http interfaces for triples have been designed. Strictly speaking, the most trivial http interface is a data dump, which is a single-file representation of a (part of a) dataset. As discussed in Section 1, this allows consumers to set up a private query endpoint. Typical http apis offer more granular access, albeit still far less flexible than sparql endpoints. The Linked Data Platform [23] is a read/write http interface for Linked Data, scheduled to become a wc recommendation. It details several concepts that extend beyond the Linked Data principles, such as containers and write access. However, the api has been designed primarily for consistent read/write access to Linked Data resources, not to enable reliable and/or efficient query execution. Another read/write interface is the sparql Graph Store Protocol [19], which describes http operations to manage rdf graphs through sparql queries. Additionally, several other fine-grained http interfaces for triples have been proposed, such as the Linked Data api [17] and Restpark [21]. Some of them aim to bridge the gap between the sparql protocol and the rest architectural style underlying the Web [28]. However, none of these proposals are widely used at the moment and no query engines for them are implemented to date. 3 3.1 Linked Data Fragments Concept and context What all of the above interfaces have in common is that, in one sense or another, they publish certain fragments of a Linked Data dataset. A sparql endpoint response, a Linked Data document, and a data dump each offer specific parts of all triples of a given collection. Rather than presenting them as fully distinct approaches, we uniformly call the result of each request to such interfaces a Linked Data Fragment (ldf) [25, 27]. As Fig. 1 shows, each kind of fragment mainly differs in its specificity. Depending on this, the workload to compute answers to queries is divided differently between clients and servers. The key to efficient and reliable Web querying is to find fragments that strike an optimal balance between client and server effort. Before we examine particular options, let us define formally what ldfs are. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability data Linked Data triple pattern dump document fragments generic requests high client effort high server availability various types of Linked Data Fragments 5 sparql result specific requests high server effort low server availability Fig. 1: All http triple interfaces offer Linked Data Fragments of a dataset. They differ in the specificity of the data they contain, and thus the effort needed to create them. 3.2 Formal definitions As a basis for our formalization, we use the following concepts of the rdf data model [16] and the sparql query language [12]. We write U , B, L, and V to denote the sets of all uris, blank nodes, literals, and variables, respectively. Then, T = (U ∪ B) × U × (U ∪ B ∪ L) is the (infinite) set of all rdf triples. Any tuple tp ∈ (U ∪ V) × (U ∪ V) × (U ∪ L ∪ V) is a triple pattern. Any finite set of such triple patterns is a basic graph pattern (bgp). Any more complex sparql graph pattern, typically denoted by P , combines triple patterns (or bgps) using specific operators [12, 20]. The standard (set-based) query semantics for sparql defines the query result of such a graph pattern P over a set of rdf triples G ⊆ T as a set that we denote by [[P ]]G and that consists of partial mappings µ : V → (U ∪ B ∪ L), which are called solution mappings. An rdf triple t is a matching triple for a triple pattern tp if there exists a solution mapping µ such that t = µ[tp], where µ[tp] denotes the triple (pattern) that we obtain by replacing the variables in tp according to µ. For the sake of a more straightforward formalization, in this paper, we assume without loss of generality that every dataset G published via some kind of fragments on the Web is a finite set of blank-node-free rdf triples; i.e., G ⊆ T ∗ where T ∗ = U × U × (U ∪ L). Each fragment of such a dataset contains triples that somehow belong together; they have been selected based on some condition, which we abstract through the notion of a selector: Definition 1 (selector). A selector is a partial function s : 2T → {true, false}. A more concrete type of this abstract notion are triple pattern selectors, which select triples that match a certain triple pattern: Definition 2 (triple pattern selector). Given a triple pattern tp, the triple pattern selector for tp is the selector stp that, for any singleton set {t} ⊆ 2T , is defined by ( true if t is a matching triple for tp, stp ({t}) = false else. When publishing data on the Web, we should equip its representations with hypermedia controls [1, 8, 9]. We encounter them on a daily basis when browsing html pages; they are usually present as hyperlinks or forms. What all these controls have in common is that, given some (possibly empty) input, they result in our browser performing a request for a specific url. Definition 3 (control). A control is a function that maps from some set to U . Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 6 Ruben Verborgh et al. In particular, we are interested in controls whose domain is a set of selectors, as they allow to create urls that correspond to data matching those selectors. By now, we have introduced all elements necessary to define fragments of an rdf-based dataset. Definition 4 (Linked Data Fragment). Let G ⊆ T ∗ be a finite set of blanknode-free rdf triples. A Linked Data Fragment (ldf) of G is a tuple f = hu, s, Γ, M, Ci with the following five elements: – u is a uri (which is the “authoritative” source from which f can be retrieved); – s is a selector; – Γ is a set consisting of all subsets of G that match selector s, that is, for every G′ ⊆ G it holds that G′ ∈ Γ if and only if G′ ∈ dom(s) and s(G′ ) = true; – M is a finite set of (additional) rdf triples, including triples that represent metadata for f ; and – C is a finite set of controls. Any source of rdf-based data on the Web can be described as an ldf by specifying the corresponding values for u, s, Γ , M , and C. For example, the result of a sparql CONSTRUCT query is an ldf where the selector is the query, the metadata set is empty, and the control set contains a sparql endpoint url [6]. Informally, we distinguish different types of ldfs, each of which represents ldfs that have the same type of selector and the same kind of conditions on their metadata M and on their controls C. Section 3.3 will show a specific ldf type. Some ldfs can be quite large; for instance, a data dump typically contains millions of triples. Downloading such a large fragment can be undesired in certain situations, for instance, if we just want to inspect part of the data in the fragment, or if we are only interested in a fragment’s metadata but not its actual data. Therefore, a server that hosts ldfs can segment them into smaller pages. Formally, we capture such a page as follows: Definition 5 (ldf page). Let f = hu, s, Γ, M, Ci be an ldf of some finite set of blank-node-free rdf triples. A page partitioning of f is a finite, nonempty set Φ consisting of so-called pages of f such that the following properties hold: 1. Each page φ ∈ Φ is a tuple φ = hu′, uf , sf , Γ ′, M ′, C ′ i with the following six elements: (i) u′ is a uri from which page φ can be retrieved with u′ = 6 u, (ii) uf = u, (iii) sf = s, (iv) Γ ′ ⊆ Γ , (v) M ′ ⊇ M , and (vi) C ′ ⊇ C. 2. For every pair of two distinct pages φi = hu′i , uf , sf , Γi′ , Mi′ , Ci′ i ∈ Φ and φj = hu′j , uf , sf , Γj′ , Mj′ , Cj′ i ∈ Φ it holds that u′i 6= u′j and Γi′ ∩ Γj′ = ∅. S 3. Γ = hu′,uf ,sf ,Γ ′,M ′,C ′ i∈Φ Γ ′ . 4. There exists a strict total order ≺ on Φ such that, for every pair of two pages φi = hu′i , uf , sf , Γi′ , Mi′ , Ci′ i ∈ Φ and φj = hu′j , uf , sf , Γj′ , Mj′ , Cj′ i ∈ Φ with φj being the direct successor of φi (i.e., φi ≺ φj and ¬∃φk ∈ Φ : φi ≺ φk ≺ φj ), there exists a control c ∈ Ci′ with u′j ∈ img(c). Note in particular that each page contains all metadata and controls of the corresponding fragment, in addition to the controls that allow to navigate from one page to the next. If paging is available, servers should automatically redirect clients from the fragment to its first page, to avoid sending overly large chunks. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability 7 The collection of all ldfs of a certain dataset provided by a server is captured formally as follows: Definition 6 (ldf collection). Let G ⊆ T ∗ be a finite set of blank-node-free rdf triples, and let c be a control. The c-specific ldf collection over G is a set F of ldfs such that, for each ldf f ∈ F with f = hu, s, Γ, M, Ci, the following three properties hold: 1. f is an ldf of G; 2. s ∈ dom(c); 3. c(s) = u. Finally, we define a query semantics for evaluating sparql queries over a dataset that is published as a collection of ldfs. Definition 7 (query semantics). Let G ⊆ T ∗ be a finite set of blank-nodefree rdf triples, and let F be some ldf collection over G. The evaluation of a sparql graph pattern P over F , denoted by [[P ]]F , is defined by [[P ]]F = [[P ]]G . 3.3 Triple Pattern Fragments The current http interfaces for rdf, as discussed in Section 2 and summarized in Fig. 1, have limitations for query evaluation over live data with high availability. To facilitate querying on the client side, clients should be able to access those fragments that correspond to important parts of the query. To maximize availability on the server side, servers should only offer those fragments they can generate with minimal effort. In other words, we have to search for a compromise along the axis in Fig. 1. Offering triple-pattern-based access to datasets seems an interesting compromise because a) graph patterns, the main building blocks for sparql queries, consist of triple patterns, so they are important query parts for clients; and b) servers can select triples corresponding to a certain triple pattern at low processing cost [7]. For this reason, we introduced a triple-pattern-based http interface for data access [26, 27], which we formalize as follows. Definition 8 (triple pattern fragment and collection). Given a control c, a c-specific ldf collection F is called a triple pattern fragment collection if, for any possible triple pattern tp, there exists an ldf hu, s, Γ, M, Ci ∈ F , referred to as a triple pattern fragment, such that the following three properties hold: 1. s is the triple pattern selector for triple pattern tp (as per Definition 2). 2. There exists a (metadata) rdf triple hu, void : triples, cnti ∈ M with cnt representing an estimate of the cardinality of Γ , that is, cnt is an integer that has the following two properties: (a) If [[tp]]G = ∅, then cnt = 0.  6 ∅, then cnt > 0 and abs |[[tp]]G | − cnt ≤ ǫ for some F -specific (b) If [[tp]]G = threshold ǫ. 3. c ∈ C. Since the selector s of a triple pattern fragment f = hu, s, Γ, M, Ci is a triple pattern selector, all elements of Γ are singleton sets: |G′ | = 1 for all G′ ∈ Γ . Large fragments would usually be paged as in Definition 5; so while a single page would not contain all matching triples of the fragment, it would contain the cnt estimate metadata for the entire fragment, together with the collection’s control. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 8 Ruben Verborgh et al. Furthermore, any triple pattern fragment collection over some set of rdf triples G consists of the complete set of triple pattern fragments of G, which in practice means the server can provide any of them when requested, i.e., it does not need to have materialized versions for all of them. Each of these fragments includes the collection-specific hypermedia control (e.g., using the Hydra Core Vocabulary [26]), making triple pattern fragment collections hypermedia-driven rest apis [9]. Consequently, by discovering an arbitrary fragment of a collection, a client can directly reach and retrieve all fragments of the collection. In particular, this includes all fragments with a selector for one of the triple patterns of a given sparql graph pattern. Therefore, clients can compute a complete query result for such a pattern over the collection after obtaining any of its fragments. In the following section, we discuss an efficient approach for performing this. 4 4.1 sparql Queries over Triple Pattern Fragments High-level algorithm Triple pattern fragments offer triple-pattern-based access to a dataset on the Web. If a client wants to evaluate a sparql query over this dataset, it should thus transform this query into a sequence of triple pattern queries. To optimize the performance of the execution, the number of http requests should be minimized, and they should execute in parallel to the extent possible. Reducing the number of expensive operations is possible by selecting a suitable order in which query parts are evaluated. Therefore, database systems use a query planner to create an optimized order, based on statistical information about the data [10]. Since such information is usually not available for data on the Web, query planners have to resort to heuristics [13]. To mitigate this, triple pattern fragments contain metadata, i.e., the number of triples matching a certain pattern. We previously introduced a recursive algorithm to efficiently evaluate basic graph patterns (bgps) over a triple pattern collection [27], since bgps form the main building blocks of sparql queries. We summarize the algorithm here: 1. For each triple pattern tpi in the bgp B = {tp1 , . . . , tpn }, fetch the first page φi1 of the ldf fi for tpi , which contains an estimate cnti of the total number of matches for tpi . Choose ǫ such that cntǫ = min({cnt1 , . . . , cntn }). 2. Fetch all remaining pages of fǫ . For each triple t in the ldf, generate the solution mapping µt such that µt [tpǫ ] = t. Then compose the subpattern Bt = {tp | tp = µt [tpj ] ∧ tpj ∈ B} \ {t}. If Bt = 6 ∅, find mappings ΩBt by calling the algorithm for Bt . Else, ΩBt = {µ∅ } with µ∅ the empty mapping. 3. Return all solution mappings µ ∈ {µt ∪ µ′ | µ′ ∈ ΩBt }. By recursively fetching those fragments with the lowest number of matches, and applying their mappings to the graph pattern, we narrow down the number of http requests that are subsequently needed. While this algorithm finds all matches for the bgp in the collection, its recursive calling structure returns all results at once, i.e., we have to wait for the first result until all other results have been computed. Furthermore, adding support for additional sparql operators to such a monolithic algorithm is impractical. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability 4.2 9 Dynamic iterator pipelines A common approach to implement query execution in database systems is through iterators that are typically arranged in a tree or a pipeline, based on which query results are computed recursively [10]. Such a pipelined approach has also been studied for Linked Data query processing [13, 15]. In order to enable incremental results and allow the straightforward addition of sparql operators, we implement a triple pattern fragments client using iterators. The previous algorithm, however, cannot be implemented by a static iterator pipeline. For instance, consider a query for architects born in European capitals: SELECT ?person ?city WHERE { ?person a dbpedia-owl:Architect. # ?person dbpprop:birthPlace ?city. # ?city dc:subject dbpedia:Capitals_in_Europe. # tp1 tp2 tp3 } LIMIT 100 Suppose the pipeline begins by finding ?city mappings for tp3 . It then needs to choose whether it will next consider tp1 or tp2 . The optimal choice, however, differs depending on the value of ?city: – For dbpedia:Paris, there are ±1,900 matches for tp2 , and ±1,200 matches for tp1 , so there will be less http requests if we continue with tp1 . – For dbpedia:Vilnius, there are 164 matches for tp2 , and ±1,200 matches for tp1 , so there will be less http requests if we continue with tp2 . With a static pipeline, we would have to choose the pipeline structure in advance and subsequently reuse it. In order to generate an optimized pipeline for each (sub-)query, we propose a divide-and-conquer strategy in which a query is decomposed dynamically into subqueries depending on partial solution mappings. The main function of an iterator is next(), which either returns a mapping or nil if no mappings are left. We first introduce a trivial start iterator, which outputs the empty mapping µ0 on the first call to next(), and nil on all subsequent calls. Next, we implement a previously defined triple pattern iterator [15] for triple pattern fragments. This iterator Itp is initialized with a predecessor iterator Ip , a triple pattern tp, and a page φ0 of an arbitrary triple pattern fragment of a collection F . The iterator then extends mappings from its predecessor by reading triples from the ldf corresponding to triple pattern tp. The url of this ldf is retrieved through the collection control in the start page φ0 . Each call to Itp .next() results in mappings for tp in F , depending on the predecessor’s mappings. To solve bgps of sparql queries, we introduce a triple pattern fragment bgp iterator. Such a bgp iterator is initialized with a predecessor Ip , a bgp B = {tp1 , . . . , tpn }, and an arbitrary triple pattern fragment page φ0 of a collection F . For an empty pattern (n = 0), a bgp iterator is equal to a start iterator. For a pattern length n = 1, it is constructed by creating a triple pattern iterator for (Ip , tp1 , φ0 ). For n ≥ 2, a bgp iterator uses Algorithm 1. bgp iterators evaluate a bgp by recursively decomposing it into smaller iterators. For each triple pattern in the bgp mapped by each result of Ip , the iterator Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 10 1 2 3 4 5 6 7 8 9 10 11 12 13 Ruben Verborgh et al. Data: (predecessor Ip , bgp B = {tp1 , . . . , tpn } with n ≥ 2, start page φ0 ) I ← nil; c ← the triple pattern control in the control set C0 of φ0 ; Function BasicGraphPatternIterator.next() µ ← nil; while µ = nil do while I = nil do µp ← Ip .next(); return nil if µp = nil; Φ ← {φi1 | φi1 = http GET first fragment page using url c(µp [tpi ])}; ǫ ← i such that cntφi1 = min({cntφ11 , . . . , cntφn1 }); Iǫ ← TriplePatternIterator(StartIterator(), µp [tpǫ ], φǫ1 ); I ← BasicGraphPatternIterator(Iǫ , {µ[tp] | tp ∈ B \ {tpǫ }}, φǫ1 ); µ ← I .next(); return µ ∪ µp ; Algorithm 1: For all mappings µp of a predecessor Ip , a bgp iterator for a pattern B = {tp1 , . . . , tpn } creates a triple pattern iterator Iǫ for the least frequent pattern tpǫ , passed to a bgp iterator for the remainder of P . fetches the first page of the corresponding ldf. This page contains the cnt metadata, which tells us how many matches the dataset has for each triple pattern. The pattern is then decomposed by evaluating it using a) a triple pattern iterator for the triple pattern with the smallest number of matches, and b) a new bgp iterator for the remainder of the pattern. This results in a dynamic pipeline for each of the mappings of its predecessor, as visualized in Fig. 2. Each pipeline is optimized locally for a specific mapping, reducing the number of requests. To evaluate a sparql query over a triple pattern fragment collection, we proceed as follows. For each bgp of the query, a bgp iterator is created. Dedicated iterators are necessary for other sparql constructs such as UNION and OPTIONAL, but their implementation need not be ldf-specific; they can reuse the triple pattern fragment bgp iterators. The predecessor of the first iterator is a start iterator. We continuously pull solution mappings from the last iterator in the pipeline and output them as solutions of the query, until the last iterator responds with nil. This pull-based process is able to deliver results incrementally. Zagreb Budapest Rome ... Alen_Peternac Drago_Ibler Juraj_Neidhardt ... ?city subject Capitals_in_Europe. ... ?person birthPlace Zagreb. B ′′ = { Drago_Ibler a Architect. } ′ B = { ?person a Architect. ?person birthPlace Zagreb. } B = { ?person a Architect. ?person birthPlace ?city. ?city subject Capitals_in_Europe. } Fig. 2: A bgp iterator decomposes a bgp B = {tp1 , . . . , tpn } into a triple pattern iterator for an optimal tpi and, for each resulting solution mapping µ of tpi , creates a bgp iterator for the remaining pattern B ′ = {tp | tp = µ[tpj ] ∧ tpj ∈ B} \ {µ[tpi ]}. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability 11 As most time of an ldf client is spent waiting on http requests, the process can be sped up by buffering the individual iterators. A major advantage of our dynamic pipelines is that, because each element of a bgp iterator uses its own separate sub-pipeline, multiple pipelines can run in parallel. E.g., given the context of Fig. 2, the pipelines for Zagreb, Budapest, and Rome can run in parallel, and so can those for the Alen_Peternac, Drago_Ibler, and Juraj_Neidhardt. This results in more concurrent http requests and thus a lower average waiting time per request. Since triple pattern fragment apis are deliberately designed to allow high throughput, clients are not bound by crawler politeness rules [14]. 5 Evaluation The goal of the evaluation is to compare the availability–performance relationship of triple-pattern-based query execution to query execution over other ldfs, sparql endpoints in particular. Performance in this case refers to the query response time (i.e., time until the client reports a first solution of the query result) and total execution time. We measure availability of a server as the fraction of cases in which the client receives a response within a specified amount of time after sending a request. For this evaluation, we use a timeout of 60 seconds. Since overloaded servers are a major cause of unavailability, we also monitor processor, memory, and bandwidth usage of servers. The assumption is that servers with high resource usage will be more prone to low availability, i.e., a temporal inability to process responses in a reasonable time. 5.1 Experimental setup We implemented the triple pattern fragments query execution approach of Section 4 as an open-source ldf client for sparql queries. This client is written in JavaScript, so it can be used either as a standalone application, or as a library for browser and server applications. While we also implemented an ldf client as an adapter for the popular Java framework Jena [11], it was not included in the comparison, because it uses the existing Jena arq querying infrastructure instead of our algorithm. The used ldf server is an open-source Java server with the compressed hdt format [7] as back-end. We provide all source code of the implementations, as well as the full benchmark configuration, at https://github.com/LinkedDataFragments/. The triple pattern fragments client/server setup is compared to four sparql endpoint infrastructures: Virtuoso (6.1.8 and 7.1.1) [5] and Jena Fuseki [11] (tdb 1.0.1 and hdt 1.1.1). To measure the availability and performance of triple pattern fragment servers and sparql endpoints under varying loads, we set up an environment with one server and a variable number of clients. In order to obtain repeatable experiments, the benchmarks were executed on virtual machines on the Amazon aws platform. The complete setup consists of 1 server (4 virtual cpus, 7.5 gb ram), 1 http cache (8 virtual cpus, 15 gb ram) and 60 client machines (4 virtual cpus, 7.5 gb ram), capable of running 4 single-threaded clients each. We purposely chose a modest server machine to show the impact for low-budget scenarios. The http cache acts as a proxy server between servers and clients and Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 12 Ruben Verborgh et al. was chosen for its bandwidth capabilities (which Amazon associates with specific cpu/ram combinations). It caches http requests for a maximum of 5 minutes. To date, no sparql availability benchmark exists; however, several performance benchmarks exist. We chose the Berlin sparql Benchmark (bsbm) [3] because of its wide-spread use, with a dataset size of 100 million triples. To mimic the variability of real-world scenarios, each client executes different bsbm query workloads based on its own random seed. As existing work on Linked Data querying focuses exclusively on bgp queries [14], this paper is the first to use a sparql benchmark on a Linked Data publishing method with a nonsparql http interface. We do not aim for best performance with triple pattern fragments; instead, we strive to improve the availability/performance balance. For our experiments, we have extended bsbm with support for parsing streaming Turtle results, and added the possibility to measure the response time (reception of first solution) in addition to the total query execution time (reception of all solutions). Some of the bsbm queries use the ORDER BY operator, which has to be implemented in a blocking way; i.e., the first solution can only be sent after all solutions have been computed. Therefore, (only) for measurements of the response time, we use variants of these queries without ORDER BY, assuming the user application prefers streaming results and performs sorting itself. After every 1-second interval during the evaluation, we measure on the server, cache, and client the current value of several properties, including cpu usage of each core, memory usage, and network io. These measurements are obtained using PerfMon, while distributed testing happens using JMeter.2 5.2 Results and discussion Figs. 3.1 to 3.10 summarize the main measurements of the evaluation. All x-axes use a logarithmic scale, because we varied the number of clients exponentially. Fig. 3.1 shows that the performance of sparql endpoints significantly decreases with the number of clients. Even though a triple pattern fragments setup executes sparql queries with lower performance, the performance decrease with a higher number of clients is significantly lower. Because of caching effects, triple pattern fragments querying starts performing slightly better with a high number of clients (n > 100). The per-core processor usage of the sparql endpoints grows rapidly (Fig. 3.5) and quickly reaches the maximum; in practice, this means the endpoint spends all cpu time processing queries while newly incoming requests are queued. The triple pattern fragments server consumes only limited cpu, because each individual request is simple to answer, and due to their finer granularity, the cache can answer several requests (Fig. 3.4). At the client side, the opposite happens (Fig. 3.7): clients of sparql endpoints hardly use cpu, whereas triple pattern fragments clients do use between 20% and 100% cpu. This percentage decreases with higher numbers of clients, because the networking time dominates. Memory consumption remains fairly constant and low (Fig. 3.8). On the server (Fig. 3.6), memory usage remains constantly high; however, each considered implementation could be configured to use less memory. 2 http://jmeter.apache.org/ and http://jmeter-plugins.org/wiki/PerfMonAgent/ Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 100 1000 10000 Virtuoso 6 Virtuoso 7 Fuseki–tdb Fuseki–hdt triple pattern fragments 10 0 clients 10 100 1 10 100 Fig. 3.2: Server network traffic 200 20 150 100 10 50 0 0 1 clients 10 100 1 clients 10 100 Fig. 3.4: Cache network traffic Fig. 3.3: Query timeouts 8 6 4 2 0 ram use (gb) 1 0.8 0.6 0.4 0.2 0 ram use (gb) cpu use (%) clients 100 50 0 clients 1 10 100 1 clients 10 100 Fig. 3.6: Server memory usage Fig. 3.5: Server processor usage per core 100 50 0 clients 1 10 100 1 clients 10 100 Fig. 3.8: Client memory usage Fig. 3.7: Client processor usage per core 10 1 1 0.1 0.1 0.01 0.01 0 0 clients 10 100 Fig. 3.9: Query 3 execution time (log-log plot) 1 clients 10 avg. time (s) 10 1 sent (mb) #timeouts Fig. 3.1: Server performance (log-log plot) cpu use (%) 4 2 1 avg. time (s) 13 data sent (mb) throughput (q/hr) Querying Datasets on the Web with High Availability 100 Fig. 3.10: Query 3 response time (log-log plot) Fig. 3.2 shows the outbound network traffic on the server with an increasing number of clients. This traffic is substantially higher with triple pattern fragment servers, because clients need to ask for several responses to evaluate a single query. The cache ensures that responses to identical requests are reused; Fig. 3.4 indeed shows that caching is far more effective with triple pattern fragments. Some of the bsbm queries execute slowly on triple pattern fragments clients, especially those queries that strongly depend on operators such as FILTER, which in a triple-pattern-based interface can only be evaluated on the client. The execution times of these queries exceed the timeout limit of 60s (Fig. 3.3). Therefore, we separately study bsbm query template 3 (finding products that satisfy 2 numerical inequalities and an OPTIONAL clause), which is one of the templates whose queries cause few timeouts. Note how its execution time (Fig. 3.9) for triple pattern fragments starts high, but only increases very gradually, whereas the execution time on sparql endpoints rises very rapidly. Furthermore, the response times increase more slowly with increased load (Fig. 3.10). Only on the triple pattern fragments server, cpu usage remains low for this query at all times. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 14 Ruben Verborgh et al. These results indicate that triple pattern fragments query execution succeeds in reducing server usage, at the cost of increased query times. Triple pattern fragments servers cope better with increasing numbers of clients than sparql endpoints. Furthermore, querying benefits strongly from regular http caching, which can be added at any point in the network. This is all the more remarkable since, to allow comparisons with other work, these results were obtained with an existing sparql benchmark that focuses on performance, not availability. Even though certain queries make it difficult for an ldf client to find all results within the timeout window (especially with blocking operators such as ORDER BY), the first results to all queries arrive before the timeout period. In the future, the development of an availability-focused sparql benchmark could stimulate availability improvements of the considered systems. The full results of our experiments are published as ldfs at http://data.linkeddatafragments.org/. 6 Conclusions Publishers of Linked Data strive to host their data reliably at minimal cost. Applications, on the other hand, need to query data in the most flexible way. The three well-known rdf interfaces on the Web—sparql endpoints, Linked Data documents, and data dumps—are just a fraction of all possible ways to transfer Linked Data from a server to a client. Since sparql endpoints offer the most flexibility, they are not coincidentally the most expensive to host with high availability. The Linked Data Fragments framework captures the search for alternative http interfaces to rdf data, trying to balance the server’s desire for maximum reliability and the client’s need for maximum flexibility. In this paper, we have shown that triple pattern fragments, which additionally contain count metadata and hypermedia controls, can reduce the load on servers to less than 30% of the load on sparql endpoints. This happens by shifting the query-specific tasks to clients, at the cost of slower query execution. Instead of sending one complex query, clients use a dynamic iterator pipeline to combine the results of several simpler queries, thereby also vastly improving the effectiveness of http caching. This captures the spirit of Web querying: clients browse pages and iteratively extract bits of information to find complex answers. The goal of triple pattern fragments is to provide those bits that are helpful for clients to evaluate queries, yet inexpensive to generate by servers. Triple pattern fragments are definitely not the final answer to querying rdf datasets on the Web. In fact, there will probably never be such a final answer. By definition, each api on the Web that publishes rdf triples (which, through json-ld [24], can include those apis that publish json) offers its own kind of ldfs. The challenge for future clients is to find answers to queries through all kinds of different fragments along the axis of ldfs. The results indicate the potential of this querying strategy, as we have shown they allow executing complex queries of common sparql benchmarks over live data on the Web with high availability. Whereas the Linked Data principles emphasize hyperlinks between data documents [2], triple pattern fragments add forms that let clients control what data they request. Those forms allow custom access, but at the same time limit the possible kind of queries in order to save on server processing resources. Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. Querying Datasets on the Web with High Availability 15 Especially in cases where there are limited financial resources to publish data, triple pattern fragments could make a significant difference: data can be hosted at low cost, in a way that allows live querying, with high availability. In addition to our own open source implementations of ldf servers, two third-party implementations are available. The Belgian Crossroads Bank for Enterprises recently published their data as triple pattern fragments (http://data.kbodata.be/) using their own open-source server. The open-source data management system The DataTank (http://thedatatank.com/) now also supports triple pattern fragments. This lowers the entry barrier for publishers even further. Implementers of clients and servers can follow the triple pattern fragments specification [26]. Improving the performance is possible if clients can query more specific fragments. In particular, support for certain FILTER expressions would speed up several queries, as triple pattern fragments only allow for exact matches. Interesting future work is therefore to define new classes of ldfs that support such features, where we always need to keep in mind that minimizing the server’s processing cost for each fragment is the key to maximizing its availability. Part of this work includes the description of such fragments, so a client can dynamically discover what fragment types a server offers, and thus how it can execute a query in an optimal way. For instance, if a server supports (a subset of) sparql, clients need to ask fewer queries than when only triple patterns are supported. This trade-off between server cost/availability and client performance will continue to exist. Finally, ldf querying shows us that we perhaps need to re-evaluate the way we develop applications on top of Linked Data. The dominant paradigm so far has been: “ask a complex question to a server; wait; act on the results”, where the “waiting” part can be long if the server has low availability. As the response times of the evaluation indicate, new applications might prefer not to wait, evolving towards a real-time, distributed paradigm: “ask simple questions to many servers, acting on results as they arrive”. A major benefit of clients that solve queries by fetching fragments of Linked Data, in addition to incrementally updating results, is that they can support distributed querying by asking fragments from different servers. In other words, limiting the http interfaces of rdf servers does not only lead to higher availability, it encourages clients to solve complex queries themselves—and for that, they have the entire Web of Data at their disposal. References 1. Amundsen, M.: Hypermedia types. In: rest: From Research to Practice, pp. 93– 116. Springer (2011) 2. Bizer, C., Heath, T., Berners-Lee, T.: Linked Data – the story so far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (Mar 2009) 3. Bizer, C., Schultz, A.: The Berlin sparql benchmark. International Journal on Semantic Web and Information Systems 5(2), 1–24 (2009) 4. Buil-Aranda, C., Hogan, A., Umbrich, J., Vandenbussche, P.Y.: sparql Webquerying infrastructure: Ready for action? In: Proceedings of the 12th International Semantic Web Conference (Nov 2013) 5. Erling, O., Mikhailov, I.: Virtuoso: rdf support in a native rdbms. In: Semantic Web Information Management, pp. 501–519. Springer (2010) Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com. 16 Ruben Verborgh et al. 6. Feigenbaum, L., Williams, G.T., Clark, K.G., Torres, E.: sparql . protocol. Recommendation, wc (Mar 2013), http://www.w3.org/TR/sparql11-protocol/ 7. Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary rdf representation for publication and exchange (hdt). Journal of Web Semantics 19, 22–41 (Mar 2013) 8. Fielding, R.T.: Architectural Styles and the Design of Network-based Software Architectures. Ph.D. thesis, University of California (2000) 9. Fielding, R.T.: rest apis must be hypertext-driven (Oct 2008), http://roy.gbiv. com/untangled/2008/rest-apis-must-be-hypertext-driven 10. Graefe, G.: Query evaluation techniques for large databases. acm Computing Surveys 25(2), 73–169 (Jun 1993) 11. Grobe, M.: rdf, Jena, sparql and the Semantic Web. In: Proceedings of the 37th Annual acm siguccs Fall Conference: Communication and Collaboration (2009) 12. Harris, S., Seaborne, A.: sparql . query language. Recommendation, wc (Mar 2013), http://www.w3.org/TR/sparql11-query/ 13. Hartig, O.: Zero-knowledge query planning for an iterator implementation of link traversal based query execution. In: Proceedings of the 8th Extended Semantic Web Conference. pp. 154–169. Springer (2011) 14. Hartig, O.: An overview on execution strategies for Linked Data queries. Datenbank-Spektrum 13(2), 89–99 (2013) 15. Hartig, O., Bizer, C., Freytag, J.C.: Executing sparql queries over the Web of Linked Data. In: Proceedings of the 8th International Semantic Web Conference. pp. 293–309. Springer (2009) 16. Klyne, G., Carrol, J.J.: Resource Description Framework (rdf): Concepts and abstract syntax. Rec., wc (Feb 2004), http://www.w3.org/TR/rdf-concepts/ 17. Linked Data api, https://code.google.com/p/linked-data-api/ 18. Morsey, M., Lehmann, J., Auer, S., Ngonga Ngomo, A.C.: dbpedia sparql benchmark – performance assessment with real queries on real data. In: Proceedings of the 9th International Semantic Web Conference (2011) 19. Ogbuji, C.: sparql 1.1 graph store http protocol. Recommendation, wc (Mar 2013), http://www.w3.org/TR/sparql11-http-rdf-update/ 20. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of sparql. acm Transactions on Database Systems 34(3), 16:1–16:45 (Sep 2009) 21. Restpark, http://lmatteis.github.io/restpark/ 22. Schmidt, M., Hornung, T., Meier, M., Pinkel, C., Lausen, G.: SP2 Bench: A sparql performance benchmark. In: Semantic Web Information Management (2010) 23. Speicher, S., Arwe, J., Malhotra, A.: Linked Data Platform 1.0. Candidate recommendation, wc (Jun 2014), http://www.w3.org/TR/2014/CR-ldp-20140619/ 24. Sporny, M., Longley, D., Kellogg, G., Lanthaler, M., Lindström, N.: json-ld 1.0 – a json-based serialization for Linked Data. Recommendation, wc (Jan 2014), http://www.w3.org/TR/json-ld/ 25. Verborgh, R.: Linked Data Fragments. Unofficial draft, Hydra wc Community Group, http://www.hydra-cg.com/spec/latest/linked-data-fragments/ 26. Verborgh, R.: Triple Pattern Fragments. Unofficial draft, Hydra wc Community Group, http://www.hydra-cg.com/spec/latest/triple-pattern-fragments/ 27. Verborgh, R., Vander Sande, M., Colpaert, P., Coppens, S., Mannens, E., Van de Walle, R.: Web-scale querying through Linked Data Fragments. In: Proceedings of the 7th Workshop on Linked Data on the Web (Apr 2014) 28. Wilde, E., Hausenblas, M.: restful sparql? You name it! – Aligning sparql with rest and resource orientation. In: Proceedings of the 4th Workshop on Emerging Web Services Technology. pp. 39–43. acm (2009) Pre-print of a paper accepted to the International Semantic Web Conference 2014 (ISWC 2014). The final publication is available at link.springer.com.