From 26.03.06 to 29.03.06, the Dagstuhl Seminar 06131 ``Peer-to-Peer-Systems and -Applications... more From 26.03.06 to 29.03.06, the Dagstuhl Seminar 06131 ``Peer-to-Peer-Systems and -Applications'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available.
In pursuit of graph processing performance, the systems community has largely abandoned general-p... more In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distr...
Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.
Existing solutions to achieve load balancing in DHTs incur a high overhead either in terms of rou... more Existing solutions to achieve load balancing in DHTs incur a high overhead either in terms of routing state or in terms of load movement generated by nodes arriving or departing the system. In this paper, we propose a set of general techniques and use them to develop a protocol based on Chord, called Y 0 , that achieves load balancing with minimal overhead under the typical assumption that the load is uniformly distributed in the identifier space. In particular, we prove that Y 0 can achieve near-optimal load balancing, while moving little load to maintain the balance, and increasing the size of the routing tables by at most a constant factor. Using extensive simulations based on real-world and synthetic capacity distributions, we show that Y 0 reduces the load imbalance of Chord from O(log n) to a less than 4 without increasing the number of links that a node needs to maintain. In addition, we study the effect of heterogeneity on both DHTs, demonstrating significantly reduced average route length as node capacities become increasingly heterogeneous. For a real-word distribution of node capacities, the route length in Y 0 is asymptotically less than half the route length in the case of a homogeneous system.
Most P2P systems that provide a DHT abstraction distribute objects randomly among "peer nodes" in... more Most P2P systems that provide a DHT abstraction distribute objects randomly among "peer nodes" in a way that results in some nodes having Θ(log N) times as many objects as the average node. Further imbalance may result due to nonuniform distribution of objects in the identifier space and a high degree of heterogeneity in object loads and node capacities. Additionally, a node's load may vary greatly over time since the system can be expected to experience continuous insertions and deletions of objects, skewed object arrival patterns, and continuous arrival and departure of nodes. In this paper, we propose an algorithm for load balancing in such heterogeneous, dynamic P2P systems. Our simulation results show that in the face of rapid arrivals and departures of objects of widely varying load, our algorithm achieves load balancing for system utilizations as high as 90% while moving only about 8% of the load that arrives into the system. Similarly, in a dynamic system where nodes arrive and depart, our algorithm moves less than 60% of the load the underlying DHT moves due to node arrivals and departures. Finally, we show that our distributed algorithm performs only negligibly worse than a similar centralized algorithm, and that node heterogeneity helps, not hurts, the scalability of our algorithm.
Proceedings Eighth Workshop on Hot Topics in Operating Systems
Early type-safe operating systems were hampered by poor performance. Contrary to these experience... more Early type-safe operating systems were hampered by poor performance. Contrary to these experiences we show that an operating system that is founded on an object-oriented, type-safe intermediate code can compete with MMUbased microkernels concerning performance while widening the realm of possibilities. Moving from hardware-based protection to softwarebased protection offers new options for operating system quality, flexibility, and versatility that are superior to traditional process models based on MMU protection. However, using a type-safe language-such as Java-alone, is not sufficient to achieve an improvement. While other Java operating systems adopted a traditional process concept, JX implements fine-grained protection boundaries. The JX System architecture consists of a set of Java components executing on the JX core that is responsible for system initialization, CPU context switching and low-level domain management. The Java code is organized in components which are loaded into domains, verified, and translated to native code. JX runs on commodity PC hardware, supports network communication, a frame grabber device, and contains an Ext2-compatible file system. Without extensive optimization this file system already reaches a throughput of 50% of Linux. The paper is structured as follows: In section 2 we describe the architecture of the JX system. The problems that appear when untrusted modules directly access hardware are discussed in section 3. Section 4 gives examples of the performance of IPC and file system access. 2 JX System Architecture The JX system consists of a small core, written in C and assembler, which is less than 100 kilobytes in size. The majority of the system is written in Java and running in separate protection domains. The core runs without any protection and therefore must be trusted. It contains functionality that can not be provided at the Java level (system initialization after boot up, saving and restoring CPU state, low-level domain management, monitoring). The Java code is organized in components (Sec. 2.2) which are loaded into domains (Sec. 2.1), verified (Sec. 2.4), and translated to native code (Sec. 2.5). A domain can communicate with another domain by using portals (Sec. 2.3). The protection of the architecture is solely based upon the JX core, the code verifier, the code translator, and hardware-dependent components (Sec. 3). These elements are the trusted computing base [7] of our architecture. 2.1 Domains A domain is the unit of protection, resource management, and typing. Protection. Components in one domain trust each other. One of our aims is code reusability between different system configurations. A component should be able to run in a separate domain, but also together (co-located) with other components in one domain. This leads to several problems: •The parameter passing semantics must be by-copy in interdomain calls, but may be by-reference in the co-located case. This is an open problem. •During a portal call a component must check the validity of the parameters because the caller could be in a different domain and is not trusted. When caller and callee are colocated (intra-domain call), the checks change their motivation-they are no longer done for security reasons but for robustness reasons. We currently parametrize the component whether a safety check should be performed or not. Resource Management. JX domains have their own heap and own memory area for stacks, code, etc. If a domain needs memory, a domain-specific policy decides whether this request is allowed and how it may be satisfied, i.e., where the memory comes from. Objects are not shared between domains, but it is possible to share memory. Other Java systems use shared objects with the consequence of complicated and not interdependent garbage collection, problems during domain termination, and quality-of-service crosstalk [13] between garbage collectors.
In this paper, we address the problem of designing a scalable, accurate query processor for peert... more In this paper, we address the problem of designing a scalable, accurate query processor for peerto-peer filesharing and similar distributed keyword search systems. Using a globally-distributed monitoring infrastructure, we perform an extensive study of the Gnutella filesharing network, characterizing its topology, data and query workloads. We observe that Gnutella's query processing approach performs well for popular content, but quite poorly for rare items with few replicas. We then consider an alternate approach based on Distributed Hash Tables (DHTs). We describe our implementation of PIERSearch, a DHT-based system, and propose a hybrid system where Gnutella is used to locate popular items, and PIERSearch for handling rare items. We develop an analytical model of the two approaches, and use it in concert with our Gnutella traces to study the tradeoff between query recall and system overhead of the hybrid system. We evaluate a variety of localized schemes for identifying items that are rare and worth handling via the DHT. Lastly, we show in a live deployment on fifty nodes on two continents that it nicely complements Gnutella in its ability to handle rare items.
2007 6th International Symposium on Information Processing in Sensor Networks, 2007
Assisted in creating, debugging, and maintaining several features in Oracle Enterprise Planning a... more Assisted in creating, debugging, and maintaining several features in Oracle Enterprise Planning and Budgeting (EPB) software suite. Worked on multiple features in the product, including the setup of PL/SQL packages and schemas, as well as the interfaces to the Java-based UI (built using Oracle's OAFramework API).
Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 2004
We are developing a distributed query processor called PIER, which is designed to run on the scal... more We are developing a distributed query processor called PIER, which is designed to run on the scale of the entire Internet. PIER utilizes a Distributed Hash Table (DHT) as its communication substrate in order to achieve scalability, reliability, decentralized control, and load balancing. PIER enhances DHTs with declarative and algebraic query interfaces, and underneath those interfaces implements multihop, in-network versions of joins, aggregation, recursion, and query/result dissemination. PIER is currently being used for diverse applications, including network monitoring, keyword-based filesharing search, and network topology mapping. We will demonstrate PIER's functionality by showing system monitoring queries running on PlanetLab, a testbed of over 300 machines distributed across the globe.
In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer ov... more In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer overlays and the key abstractions that can be built on them. In doing so, we hope to facilitate independent innovation in overlay protocols, services, and applications, to allow direct experimental comparisons, and to encourage application development by third parties. We provide a snapshot of our efforts and discuss open problems in an effort to solicit feedback from the research community.
The Internet's core routing infrastructure, while arguably robust and efficient, has proven t... more The Internet's core routing infrastructure, while arguably robust and efficient, has proven to be difficult to evolve to accommodate the needs of new applications. Prior research on this problem has included new hard-coded routing protocols on the one hand, and fully extensible Active Networks on the other. In this paper, we explore a new point in this design space that aims to strike a better balance between the extensibility and robustness of a routing infrastructure. The basic idea of our solution, which we call declarative routing , is to express routing protocols using a database query language. We show that our query language is a natural fit for routing, and can express a variety of well-known routing protocols in a compact and clean fashion. We discuss the security of our proposal in terms of its computational expressive power and language design. Via simulation, and deployment on PlanetLab, we demonstrate that our system imposes no fundamental limits relative to traditi...
Large-scale distributed systems are hard to deploy, and distributed hash tables (DHTs) are no exc... more Large-scale distributed systems are hard to deploy, and distributed hash tables (DHTs) are no exception. To lower the barriers facing DHT-based applications, we have created a public DHT service called OpenDHT. Designing a DHT that can be widely shared, both among mutually untrusting clients and among a variety of applications, poses two distinct challenges. First, there must be adequate control over storage allocation so that greedy or malicious clients do not use more than their fair share. Second, the interface to the DHT should make it easy to write simple clients, yet be sufficiently general to meet a broad spectrum of application requirements. In this paper we describe our solutions to these design challenges. We also report our early deployment experience with OpenDHT and describe the variety of applications already using the system.
Attempts to generalize the Internet's point-to-point communication abstraction to provide ser... more Attempts to generalize the Internet's point-to-point communication abstraction to provide services like multicast, anycast, and mobility have faced challenging technical problems and deployment barriers. To ease the deployment of such services, this paper proposes an overlay-based Internet Indirection Infrastructure ( I3) that offers a rendezvous-based communication abstraction. Instead of explicitly sending a packet to a destination, each packet is associated with an identifier; this identifier is then used by the receiver to obtain delivery of the packet. This level of indirection decouples the act of sending from the act of receiving, and allows I3 to efficiently support a wide variety of fundamental communication services. To demonstrate the feasibility of this approach, we have designed and built a prototype based on the Chord lookup protocol.
Declarative Networking is a programming methodology that enables developers to concisely specify ... more Declarative Networking is a programming methodology that enables developers to concisely specify network protocols and services, which are directly compiled to a dataflow framework that executes the specifications. This paper provides an introduction to basic issues in declarative networking, including language design, optimization, and dataflow execution. We present the intuition behind declarative programming of networks, including roots in Datalog, extensions for networked environments, and the semantics of long-running queries over network state. We focus on a sublanguage we call Network Datalog (NDlog) , including execution strategies that provide crisp eventual consistency semantics with significant flexibility in execution. We also describe a more general language called Overlog , which makes some compromises between expressive richness and semantic guarantees. We provide an overview of declarative network protocols, with a focus on routing protocols and overlay networks. Fin...
We explore the utility and execution of recursive queries as an interface for querying distribute... more We explore the utility and execution of recursive queries as an interface for querying distributed network graph structures. To illustrate the power of recursive queries, we give several examples of computing structural properties of a P2P network such as reachability and resilience. To demonstrate the feasibility of our proposal, we sketch execution strategies for these queries using PIER, a P2P relational query processor over Distributed Hash Tables (DHTs). Finally, we discuss the relationship between innetwork query processing and distance-vector like routing protocols.
To meet the demands of new Internet applications, recent work argues for giving end-hosts more co... more To meet the demands of new Internet applications, recent work argues for giving end-hosts more control over routing. To achieve this goal, we propose the use of a recursive query language, which allows users to define their own routing protocols. Recursive queries can be used to express a large variety of route requests such as the k shortest paths, shortest paths that avoid (or include) a given set of nodes and least-loaded paths. We show that these queries can be efficiently implemented in the network, and in the simple case when all users request shortest paths, the communication overhead of our solution is similar to that incurred by a distance vector protocol. In addition, when only a subset of nodes issue the same query, the communication cost can be further lowered using automatic query optimization techniques-suggesting that declarative queries and automatic optimization are important in this domain. Finally, we outline the main challenges faced by our proposal, focusing on the expressiveness and efficiency of our proposal.
There have been recent proposals in the networking and distributed systems literature on declarat... more There have been recent proposals in the networking and distributed systems literature on declarative networking, where network protocols are declaratively specified using a recursive query language. This represents a significant new application area for recursive query processing technologies from databases. In this paper, we extend upon these recent proposals in the following ways. First, we motivate and formally define the NDlog language for declarative network specifications. We introduce the concept of link-restricted rules, which can be syntactically guaranteed to be executable via single-node derivations and message passing on an underlying network graph. Second, we introduce and prove correct relaxed versions of the traditional semi-naive execution technique that overcome fundamental problems of traditional semi-naïve evaluation in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the "eventual consistency" of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques and new optimizations. Last, we present evaluation results based on an implementation of the above ideas in the P2 declarative networking system, running on 100 machines over the Emulab network testbed.
One of the key reasons overlay networks are seen as an excellent platform for large scale distrib... more One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best be suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than the baseline algorithm.
Managing failures and configuring systems properly are of critical importance for robust distribu... more Managing failures and configuring systems properly are of critical importance for robust distributed services. Unfortunately, protocols offering strong fault-tolerance guarantees are generally too costly and insensitive to performance criteria. Yet, system management in practice is often ad-hoc and ill-defined, leading to under-utilized capacity or adverse effects from poorly-behaving machines. This paper proposes a new abstraction called linkattestation groups (LA-Groups) for building robust distributed systems. Developers specify application-level correctness conditions or performance requirements for nodes. Nodes vouch for each other's acceptability within small groups of nodes through digitally-signed link attestations, and then apply a link-state protocol to determine these group relationships. By exposing such an attestation graph, LA-Groups enable the application (1) to make more informed decisions about its level of fault tolerance, security, or performance, and (2) to improve such properties by fluidly partitioning large-scale systems into small, better-suited groups. To demonstrate how LA-Groups can benefit systems, we outline designs for several applications-structured overlay routing, multicast, file sharing, and worm containmentthat are robust against various failures.
From 26.03.06 to 29.03.06, the Dagstuhl Seminar 06131 ``Peer-to-Peer-Systems and -Applications... more From 26.03.06 to 29.03.06, the Dagstuhl Seminar 06131 ``Peer-to-Peer-Systems and -Applications'' was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available.
In pursuit of graph processing performance, the systems community has largely abandoned general-p... more In pursuit of graph processing performance, the systems community has largely abandoned general-purpose distributed dataflow frameworks in favor of specialized graph processing systems that provide tailored programming abstractions and accelerate the execution of iterative graph algorithms. In this paper we argue that many of the advantages of specialized graph processing systems can be recovered in a modern general-purpose distributed dataflow system. We introduce GraphX, an embedded graph processing framework built on top of Apache Spark, a widely used distributed dataflow system. GraphX presents a familiar composable graph abstraction that is sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join, map, group-by). To achieve performance parity with specialized graph systems, GraphX recasts graph-specific optimizations as distributed join optimizations and materialized view maintenance. By leveraging advances in distr...
Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies.
Existing solutions to achieve load balancing in DHTs incur a high overhead either in terms of rou... more Existing solutions to achieve load balancing in DHTs incur a high overhead either in terms of routing state or in terms of load movement generated by nodes arriving or departing the system. In this paper, we propose a set of general techniques and use them to develop a protocol based on Chord, called Y 0 , that achieves load balancing with minimal overhead under the typical assumption that the load is uniformly distributed in the identifier space. In particular, we prove that Y 0 can achieve near-optimal load balancing, while moving little load to maintain the balance, and increasing the size of the routing tables by at most a constant factor. Using extensive simulations based on real-world and synthetic capacity distributions, we show that Y 0 reduces the load imbalance of Chord from O(log n) to a less than 4 without increasing the number of links that a node needs to maintain. In addition, we study the effect of heterogeneity on both DHTs, demonstrating significantly reduced average route length as node capacities become increasingly heterogeneous. For a real-word distribution of node capacities, the route length in Y 0 is asymptotically less than half the route length in the case of a homogeneous system.
Most P2P systems that provide a DHT abstraction distribute objects randomly among "peer nodes" in... more Most P2P systems that provide a DHT abstraction distribute objects randomly among "peer nodes" in a way that results in some nodes having Θ(log N) times as many objects as the average node. Further imbalance may result due to nonuniform distribution of objects in the identifier space and a high degree of heterogeneity in object loads and node capacities. Additionally, a node's load may vary greatly over time since the system can be expected to experience continuous insertions and deletions of objects, skewed object arrival patterns, and continuous arrival and departure of nodes. In this paper, we propose an algorithm for load balancing in such heterogeneous, dynamic P2P systems. Our simulation results show that in the face of rapid arrivals and departures of objects of widely varying load, our algorithm achieves load balancing for system utilizations as high as 90% while moving only about 8% of the load that arrives into the system. Similarly, in a dynamic system where nodes arrive and depart, our algorithm moves less than 60% of the load the underlying DHT moves due to node arrivals and departures. Finally, we show that our distributed algorithm performs only negligibly worse than a similar centralized algorithm, and that node heterogeneity helps, not hurts, the scalability of our algorithm.
Proceedings Eighth Workshop on Hot Topics in Operating Systems
Early type-safe operating systems were hampered by poor performance. Contrary to these experience... more Early type-safe operating systems were hampered by poor performance. Contrary to these experiences we show that an operating system that is founded on an object-oriented, type-safe intermediate code can compete with MMUbased microkernels concerning performance while widening the realm of possibilities. Moving from hardware-based protection to softwarebased protection offers new options for operating system quality, flexibility, and versatility that are superior to traditional process models based on MMU protection. However, using a type-safe language-such as Java-alone, is not sufficient to achieve an improvement. While other Java operating systems adopted a traditional process concept, JX implements fine-grained protection boundaries. The JX System architecture consists of a set of Java components executing on the JX core that is responsible for system initialization, CPU context switching and low-level domain management. The Java code is organized in components which are loaded into domains, verified, and translated to native code. JX runs on commodity PC hardware, supports network communication, a frame grabber device, and contains an Ext2-compatible file system. Without extensive optimization this file system already reaches a throughput of 50% of Linux. The paper is structured as follows: In section 2 we describe the architecture of the JX system. The problems that appear when untrusted modules directly access hardware are discussed in section 3. Section 4 gives examples of the performance of IPC and file system access. 2 JX System Architecture The JX system consists of a small core, written in C and assembler, which is less than 100 kilobytes in size. The majority of the system is written in Java and running in separate protection domains. The core runs without any protection and therefore must be trusted. It contains functionality that can not be provided at the Java level (system initialization after boot up, saving and restoring CPU state, low-level domain management, monitoring). The Java code is organized in components (Sec. 2.2) which are loaded into domains (Sec. 2.1), verified (Sec. 2.4), and translated to native code (Sec. 2.5). A domain can communicate with another domain by using portals (Sec. 2.3). The protection of the architecture is solely based upon the JX core, the code verifier, the code translator, and hardware-dependent components (Sec. 3). These elements are the trusted computing base [7] of our architecture. 2.1 Domains A domain is the unit of protection, resource management, and typing. Protection. Components in one domain trust each other. One of our aims is code reusability between different system configurations. A component should be able to run in a separate domain, but also together (co-located) with other components in one domain. This leads to several problems: •The parameter passing semantics must be by-copy in interdomain calls, but may be by-reference in the co-located case. This is an open problem. •During a portal call a component must check the validity of the parameters because the caller could be in a different domain and is not trusted. When caller and callee are colocated (intra-domain call), the checks change their motivation-they are no longer done for security reasons but for robustness reasons. We currently parametrize the component whether a safety check should be performed or not. Resource Management. JX domains have their own heap and own memory area for stacks, code, etc. If a domain needs memory, a domain-specific policy decides whether this request is allowed and how it may be satisfied, i.e., where the memory comes from. Objects are not shared between domains, but it is possible to share memory. Other Java systems use shared objects with the consequence of complicated and not interdependent garbage collection, problems during domain termination, and quality-of-service crosstalk [13] between garbage collectors.
In this paper, we address the problem of designing a scalable, accurate query processor for peert... more In this paper, we address the problem of designing a scalable, accurate query processor for peerto-peer filesharing and similar distributed keyword search systems. Using a globally-distributed monitoring infrastructure, we perform an extensive study of the Gnutella filesharing network, characterizing its topology, data and query workloads. We observe that Gnutella's query processing approach performs well for popular content, but quite poorly for rare items with few replicas. We then consider an alternate approach based on Distributed Hash Tables (DHTs). We describe our implementation of PIERSearch, a DHT-based system, and propose a hybrid system where Gnutella is used to locate popular items, and PIERSearch for handling rare items. We develop an analytical model of the two approaches, and use it in concert with our Gnutella traces to study the tradeoff between query recall and system overhead of the hybrid system. We evaluate a variety of localized schemes for identifying items that are rare and worth handling via the DHT. Lastly, we show in a live deployment on fifty nodes on two continents that it nicely complements Gnutella in its ability to handle rare items.
2007 6th International Symposium on Information Processing in Sensor Networks, 2007
Assisted in creating, debugging, and maintaining several features in Oracle Enterprise Planning a... more Assisted in creating, debugging, and maintaining several features in Oracle Enterprise Planning and Budgeting (EPB) software suite. Worked on multiple features in the product, including the setup of PL/SQL packages and schemas, as well as the interfaces to the Java-based UI (built using Oracle's OAFramework API).
Proceedings of the 2004 ACM SIGMOD international conference on Management of data, 2004
We are developing a distributed query processor called PIER, which is designed to run on the scal... more We are developing a distributed query processor called PIER, which is designed to run on the scale of the entire Internet. PIER utilizes a Distributed Hash Table (DHT) as its communication substrate in order to achieve scalability, reliability, decentralized control, and load balancing. PIER enhances DHTs with declarative and algebraic query interfaces, and underneath those interfaces implements multihop, in-network versions of joins, aggregation, recursion, and query/result dissemination. PIER is currently being used for diverse applications, including network monitoring, keyword-based filesharing search, and network topology mapping. We will demonstrate PIER's functionality by showing system monitoring queries running on PlanetLab, a testbed of over 300 machines distributed across the globe.
In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer ov... more In this paper, we describe an ongoing effort to define common APIs for structured peer-to-peer overlays and the key abstractions that can be built on them. In doing so, we hope to facilitate independent innovation in overlay protocols, services, and applications, to allow direct experimental comparisons, and to encourage application development by third parties. We provide a snapshot of our efforts and discuss open problems in an effort to solicit feedback from the research community.
The Internet's core routing infrastructure, while arguably robust and efficient, has proven t... more The Internet's core routing infrastructure, while arguably robust and efficient, has proven to be difficult to evolve to accommodate the needs of new applications. Prior research on this problem has included new hard-coded routing protocols on the one hand, and fully extensible Active Networks on the other. In this paper, we explore a new point in this design space that aims to strike a better balance between the extensibility and robustness of a routing infrastructure. The basic idea of our solution, which we call declarative routing , is to express routing protocols using a database query language. We show that our query language is a natural fit for routing, and can express a variety of well-known routing protocols in a compact and clean fashion. We discuss the security of our proposal in terms of its computational expressive power and language design. Via simulation, and deployment on PlanetLab, we demonstrate that our system imposes no fundamental limits relative to traditi...
Large-scale distributed systems are hard to deploy, and distributed hash tables (DHTs) are no exc... more Large-scale distributed systems are hard to deploy, and distributed hash tables (DHTs) are no exception. To lower the barriers facing DHT-based applications, we have created a public DHT service called OpenDHT. Designing a DHT that can be widely shared, both among mutually untrusting clients and among a variety of applications, poses two distinct challenges. First, there must be adequate control over storage allocation so that greedy or malicious clients do not use more than their fair share. Second, the interface to the DHT should make it easy to write simple clients, yet be sufficiently general to meet a broad spectrum of application requirements. In this paper we describe our solutions to these design challenges. We also report our early deployment experience with OpenDHT and describe the variety of applications already using the system.
Attempts to generalize the Internet's point-to-point communication abstraction to provide ser... more Attempts to generalize the Internet's point-to-point communication abstraction to provide services like multicast, anycast, and mobility have faced challenging technical problems and deployment barriers. To ease the deployment of such services, this paper proposes an overlay-based Internet Indirection Infrastructure ( I3) that offers a rendezvous-based communication abstraction. Instead of explicitly sending a packet to a destination, each packet is associated with an identifier; this identifier is then used by the receiver to obtain delivery of the packet. This level of indirection decouples the act of sending from the act of receiving, and allows I3 to efficiently support a wide variety of fundamental communication services. To demonstrate the feasibility of this approach, we have designed and built a prototype based on the Chord lookup protocol.
Declarative Networking is a programming methodology that enables developers to concisely specify ... more Declarative Networking is a programming methodology that enables developers to concisely specify network protocols and services, which are directly compiled to a dataflow framework that executes the specifications. This paper provides an introduction to basic issues in declarative networking, including language design, optimization, and dataflow execution. We present the intuition behind declarative programming of networks, including roots in Datalog, extensions for networked environments, and the semantics of long-running queries over network state. We focus on a sublanguage we call Network Datalog (NDlog) , including execution strategies that provide crisp eventual consistency semantics with significant flexibility in execution. We also describe a more general language called Overlog , which makes some compromises between expressive richness and semantic guarantees. We provide an overview of declarative network protocols, with a focus on routing protocols and overlay networks. Fin...
We explore the utility and execution of recursive queries as an interface for querying distribute... more We explore the utility and execution of recursive queries as an interface for querying distributed network graph structures. To illustrate the power of recursive queries, we give several examples of computing structural properties of a P2P network such as reachability and resilience. To demonstrate the feasibility of our proposal, we sketch execution strategies for these queries using PIER, a P2P relational query processor over Distributed Hash Tables (DHTs). Finally, we discuss the relationship between innetwork query processing and distance-vector like routing protocols.
To meet the demands of new Internet applications, recent work argues for giving end-hosts more co... more To meet the demands of new Internet applications, recent work argues for giving end-hosts more control over routing. To achieve this goal, we propose the use of a recursive query language, which allows users to define their own routing protocols. Recursive queries can be used to express a large variety of route requests such as the k shortest paths, shortest paths that avoid (or include) a given set of nodes and least-loaded paths. We show that these queries can be efficiently implemented in the network, and in the simple case when all users request shortest paths, the communication overhead of our solution is similar to that incurred by a distance vector protocol. In addition, when only a subset of nodes issue the same query, the communication cost can be further lowered using automatic query optimization techniques-suggesting that declarative queries and automatic optimization are important in this domain. Finally, we outline the main challenges faced by our proposal, focusing on the expressiveness and efficiency of our proposal.
There have been recent proposals in the networking and distributed systems literature on declarat... more There have been recent proposals in the networking and distributed systems literature on declarative networking, where network protocols are declaratively specified using a recursive query language. This represents a significant new application area for recursive query processing technologies from databases. In this paper, we extend upon these recent proposals in the following ways. First, we motivate and formally define the NDlog language for declarative network specifications. We introduce the concept of link-restricted rules, which can be syntactically guaranteed to be executable via single-node derivations and message passing on an underlying network graph. Second, we introduce and prove correct relaxed versions of the traditional semi-naive execution technique that overcome fundamental problems of traditional semi-naïve evaluation in an asynchronous distributed setting. Third, we consider the dynamics of network state, and formalize the "eventual consistency" of our programs even when bursts of updates can arrive in the midst of query execution. Fourth, we present a number of query optimization opportunities that arise in the declarative networking context, including applications of traditional techniques and new optimizations. Last, we present evaluation results based on an implementation of the above ideas in the P2 declarative networking system, running on 100 machines over the Emulab network testbed.
One of the key reasons overlay networks are seen as an excellent platform for large scale distrib... more One of the key reasons overlay networks are seen as an excellent platform for large scale distributed systems is their resilience in the presence of node failures. This resilience rely on accurate and timely detection of node failures. Despite the prevalent use of keep-alive algorithms in overlay networks to detect node failures, their tradeoffs and the circumstances in which they might best be suited is not well understood. In this paper, we study how the design of various keep-alive approaches affect their performance in node failure detection time, probability of false positive, control overhead, and packet loss rate via analysis, simulation, and implementation. We find that among the class of keep-alive algorithms that share information, the maintenance of backpointer state substantially improves detection time and packet loss rate. The improvement in detection time between baseline and sharing algorithms becomes more pronounced as the size of neighbor set increases. Finally, sharing of information allows a network to tolerate a higher churn rate than the baseline algorithm.
Managing failures and configuring systems properly are of critical importance for robust distribu... more Managing failures and configuring systems properly are of critical importance for robust distributed services. Unfortunately, protocols offering strong fault-tolerance guarantees are generally too costly and insensitive to performance criteria. Yet, system management in practice is often ad-hoc and ill-defined, leading to under-utilized capacity or adverse effects from poorly-behaving machines. This paper proposes a new abstraction called linkattestation groups (LA-Groups) for building robust distributed systems. Developers specify application-level correctness conditions or performance requirements for nodes. Nodes vouch for each other's acceptability within small groups of nodes through digitally-signed link attestations, and then apply a link-state protocol to determine these group relationships. By exposing such an attestation graph, LA-Groups enable the application (1) to make more informed decisions about its level of fault tolerance, security, or performance, and (2) to improve such properties by fluidly partitioning large-scale systems into small, better-suited groups. To demonstrate how LA-Groups can benefit systems, we outline designs for several applications-structured overlay routing, multicast, file sharing, and worm containmentthat are robust against various failures.
Uploads
Papers by Ion Stoica