Skip to main content

Abdul Muttalib

Followers

39

Following

28

Public Views

Uploads

Papers by Abdul Muttalib

Consensus in the presence of partial synchrony

Journal of The ACM, 1988

The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies be... more The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound A on the time required for a message to be sent from one processor to another and a known fixed upper bound % on the relative speeds of different processors. In an asynchronous system no fixed upper bounds A and @ exist. In one version of partial synchrony, fixed bounds A and Cp exist, but they are not known a priori. The problem is to design protocols that work correctly in the partially synchronous system regardless of the actual values of the bounds A and Cp. In another version of partial synchrony, the bounds are known, but are only guaranteed to hold starting at some unknown time T, and protocols must be designed to work correctly regardless of when time T occurs. Fault-tolerant consensus protocols are given for various cases of partial synchrony and various fault models. Lower bounds that show in most cases that our protocols are optimal with respect to the number of faults tolerated are also given. Our consensus protocols for partially synchronous processors use new protocols for fault-tolerant "distributed clocks" that allow partially synchronous processors to reach some approximately common notion of time.

SIFT: Design and analysis of a fault-tolerant computer for aircraft control

Proceedings of The IEEE, 1978

SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft co... more SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replication of tasks among processing units. The main processing units are off-the-shelf minicomputers, with standard microcomputers serving as the interface to the I/O system. Fault isolation is achieved by using a specially designed redundant bus system to interconnect the proeessing units. Error detection and analysis and system reconfiguration are performed by software. Iterative tasks are redundantly executed, and the results of each iteration are voted upon before being used. Thus, any single failure in a processing unit or bus can be tolerated with triplication of tasks, and subsequent failures can be tolerated after reconfiguration. Independent execution by separate processors means that the processors need only be loosely synchronized, and a novel fault-tolerant synchronization method is described. The SIFT software is highly structured and is formally specified using the SRI-developed SPECIAL language. The correctness of SIFT is to be proved using a hierarchy of formal models. A Markov model is used both to analyze the reliability of the system and to serve as the formal requirement for the SIFT design. Axioms are given to characterize the high-level behavior of the system, from which a correctness statement has been proved. An engineering test version of SIFT is currently being built.

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys, 1990

The state machine approach is a general method for implementing fault-tolerant services in distri... more The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models-Byzantine and fail stop. System reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

Completeness theorems for non-cryptographic fault-tolerant distributed computation

Every function of n inputs can be efficiently computed by a complete network of n processors in s... more

Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination

The demand for streaming multimedia applications is growing at an incredible rate. In this paper,... more The demand for streaming multimedia applications is growing at an incredible rate. In this paper, we propose Bayeux, an efficient application-level multicast system that scales to arbitrarily large receiver groups while tolerating failures in routers and network links. Bayeux also includes specific mechanisms for load-balancing across replicate root nodes and more efficient bandwidth consumption. Our simulation results indicate that Bayeux maintains these properties while keeping transmission overhead low. To achieve these properties, Bayeux leverages the architecture of Tapestry, a faulttolerant, wide-area overlay routing and location network. § has a neighbor map with multiple levels, where each level represents a matching suffix up to a digit position in the ID. A given level of the neighbor map contains a number of entries equal to the base of the ID, where the¨th entry in the © th level is the ID and location of the closest node which ends in "¨"+suffix( § , © ). For example, the 9th entry of the 4th level for node 325AE is the node closest to 325AE in network distance which ends in 95AE.

Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems, 1983

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is... more

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of The ACM, 1989

Abstract. An Information Dispersal Algorithm (IDA) is developed that breaks a file F of length L ... more

Reliable communication in the presence of failures

ACM Transactions on Computer Systems, 1985

The design and correctness of a communication facility for a distributed computer system are repo... more The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local-and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.

System Structure for Software Fault Tolerance

IEEE Transactions on Software Engineering, 1975

This paper presents and discusses the rationale behind a method for structuring complex computing... more This paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term "recovery blocks," "conversations," and "fault-tolerant interfaces." The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing

In today's chaotic network, data and services are mobile and replicated widely for availability, ... more In today's chaotic network, data and services are mobile and replicated widely for availability, durability, and locality. Components within this infrastructure interact in rich and complex ways, greatly stressing traditional approaches to name service and routing. This paper explores an alternative to traditional approaches called Tapestry. Tapestry is an overlay location and routing infrastructure that provides location-independent routing of messages directly to the closest copy of an object or service using only point-to-point links and without centralized resources. The routing and directory information within this infrastructure is purely soft state and easily repaired. Tapestry is self-administering, faulttolerant, and resilient under load. This paper presents the architecture and algorithms of Tapestry and explores their advantages through a number of experiments.

A distributed trust model

The widespread use of the Internet signals the need for a better understanding of trust as a basi... more

Consensus in the presence of partial synchrony

Journal of The ACM, 1988

The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies be... more The concept of partial synchrony in a distributed system is introduced. Partial synchrony lies between the cases of a synchronous system and an asynchronous system. In a synchronous system, there is a known fixed upper bound A on the time required for a message to be sent from one processor to another and a known fixed upper bound % on the relative speeds of different processors. In an asynchronous system no fixed upper bounds A and @ exist. In one version of partial synchrony, fixed bounds A and Cp exist, but they are not known a priori. The problem is to design protocols that work correctly in the partially synchronous system regardless of the actual values of the bounds A and Cp. In another version of partial synchrony, the bounds are known, but are only guaranteed to hold starting at some unknown time T, and protocols must be designed to work correctly regardless of when time T occurs. Fault-tolerant consensus protocols are given for various cases of partial synchrony and various fault models. Lower bounds that show in most cases that our protocols are optimal with respect to the number of faults tolerated are also given. Our consensus protocols for partially synchronous processors use new protocols for fault-tolerant "distributed clocks" that allow partially synchronous processors to reach some approximately common notion of time.

SIFT: Design and analysis of a fault-tolerant computer for aircraft control

Proceedings of The IEEE, 1978

SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft co... more SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replication of tasks among processing units. The main processing units are off-the-shelf minicomputers, with standard microcomputers serving as the interface to the I/O system. Fault isolation is achieved by using a specially designed redundant bus system to interconnect the proeessing units. Error detection and analysis and system reconfiguration are performed by software. Iterative tasks are redundantly executed, and the results of each iteration are voted upon before being used. Thus, any single failure in a processing unit or bus can be tolerated with triplication of tasks, and subsequent failures can be tolerated after reconfiguration. Independent execution by separate processors means that the processors need only be loosely synchronized, and a novel fault-tolerant synchronization method is described. The SIFT software is highly structured and is formally specified using the SRI-developed SPECIAL language. The correctness of SIFT is to be proved using a hierarchy of formal models. A Markov model is used both to analyze the reliability of the system and to serve as the formal requirement for the SIFT design. Axioms are given to characterize the high-level behavior of the system, from which a correctness statement has been proved. An engineering test version of SIFT is currently being built.

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys, 1990

The state machine approach is a general method for implementing fault-tolerant services in distri... more The state machine approach is a general method for implementing fault-tolerant services in distributed systems. This paper reviews the approach and describes protocols for two different failure models-Byzantine and fail stop. System reconfiguration techniques for removing faulty components and integrating repaired components are also discussed.

Completeness theorems for non-cryptographic fault-tolerant distributed computation

Every function of n inputs can be efficiently computed by a complete network of n processors in s... more

Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination

The demand for streaming multimedia applications is growing at an incredible rate. In this paper,... more The demand for streaming multimedia applications is growing at an incredible rate. In this paper, we propose Bayeux, an efficient application-level multicast system that scales to arbitrarily large receiver groups while tolerating failures in routers and network links. Bayeux also includes specific mechanisms for load-balancing across replicate root nodes and more efficient bandwidth consumption. Our simulation results indicate that Bayeux maintains these properties while keeping transmission overhead low. To achieve these properties, Bayeux leverages the architecture of Tapestry, a faulttolerant, wide-area overlay routing and location network. § has a neighbor map with multiple levels, where each level represents a matching suffix up to a digit position in the ID. A given level of the neighbor map contains a number of entries equal to the base of the ID, where the¨th entry in the © th level is the ID and location of the closest node which ends in "¨"+suffix( § , © ). For example, the 9th entry of the 4th level for node 325AE is the node closest to 325AE in network distance which ends in 95AE.

Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems, 1983

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is... more

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of The ACM, 1989

Abstract. An Information Dispersal Algorithm (IDA) is developed that breaks a file F of length L ... more

Reliable communication in the presence of failures

ACM Transactions on Computer Systems, 1985

The design and correctness of a communication facility for a distributed computer system are repo... more The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local-and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistent orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols in the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.

System Structure for Software Fault Tolerance

IEEE Transactions on Software Engineering, 1975

This paper presents and discusses the rationale behind a method for structuring complex computing... more This paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term "recovery blocks," "conversations," and "fault-tolerant interfaces." The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing

In today's chaotic network, data and services are mobile and replicated widely for availability, ... more In today's chaotic network, data and services are mobile and replicated widely for availability, durability, and locality. Components within this infrastructure interact in rich and complex ways, greatly stressing traditional approaches to name service and routing. This paper explores an alternative to traditional approaches called Tapestry. Tapestry is an overlay location and routing infrastructure that provides location-independent routing of messages directly to the closest copy of an object or service using only point-to-point links and without centralized resources. The routing and directory information within this infrastructure is purely soft state and easily repaired. Tapestry is self-administering, faulttolerant, and resilient under load. This paper presents the architecture and algorithms of Tapestry and explores their advantages through a number of experiments.

A distributed trust model

The widespread use of the Internet signals the need for a better understanding of trust as a basi... more