Academia.eduAcademia.edu

BASE: Using abstraction to improve fault tolerance

2003, ACM Transactions on Computer Systems

Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or nondeterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, nondeterministic implementation. These examples suggest that our technique can be used in practice-in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse.

BASE: Using Abstraction to Improve Fault Tolerance MIGUEL CASTRO Microsoft Research and RODRIGO RODRIGUES and BARBARA LISKOV MIT Laboratory for Computer Science Software errors are a major cause of outages and they are increasingly exploited in malicious attacks. Byzantine fault tolerance allows replicated systems to mask some software errors but it is expensive to deploy. This paper describes a replication technique, BASE, which uses abstraction to reduce the cost of Byzantine fault tolerance and to improve its ability to mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations. It improves availability because each replica can be repaired periodically using an abstract view of the state stored by correct replicas, and because each replica can run distinct or nondeterministic service implementations, which reduces the probability of common mode failures. We built an NFS service where each replica can run a different off-the-shelf file system implementation, and an object-oriented database where the replicas ran the same, nondeterministic implementation. These examples suggest that our technique can be used in practice—in both cases, the implementation required only a modest amount of new code, and our performance results indicate that the replicated services perform comparably to the implementations that they reuse. Categories and Subject Descriptors: C.2.0 [Computer-Communication Networks]: General— Security and protection; C.2.4 [Computer-Communication Networks]: Distributed Systems— Client/server; D.4.3 [Operating Systems]: File Systems Management; D.4.5 [Operating Systems]: Reliability—Fault tolerance; D.4.6 [Operating Systems]: Security and Protection—Access controls; authentication; cryptographic controls; D.4.8 [Operating Systems]: Performance— Measurements; H.2.0 [Database Management]: General—Security, integrity, and protection; H.2.4 [Database Management]: Systems—Object-oriented databases General Terms: Security, Reliability, Algorithms, Performance, Measurement Additional Key Words and Phrases: Byzantine fault tolerance, state machine replication, proactive recovery, asynchronous systems, N-version programming This research was partially supported by DARPA under contract F30602-98-1-0237 monitored by the Air Force Research Laboratory. Rodrigo Rodrigues was partially funded by a Praxis XXI fellowship. Authors’ addresses: M. Castro, Microsoft Research, 7 J. J. Thomson Avenue, Cambridge CB3 OFB, UK; email: [email protected]; R. Rodrigues and B. Liskov, MIT Laboratory for Computer Science, 545 Technology Sq., Cambridge, MA 02139; email: {rodrigo,liskov}@lcs.mit.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. ° C 2003 ACM 0734-2071/03/0800-0236 $5.00 ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003, Pages 236–269. BASE: Using Abstraction to Improve Fault Tolerance • 237 1. INTRODUCTION There is a growing demand for highly-available systems that provide correct service without interruptions. These systems must tolerate software errors because they are a major cause of outages [Gray and Siewiorek 1991]. Furthermore, there is an increasing number of malicious attacks that exploit software errors to gain control or deny access to systems that provide important services. This paper proposes a replication technique, BASE, that combines Byzantine fault tolerance [Pease et al. 1980] with work on data abstraction [Liskov and Guttag 2000]. Byzantine fault tolerance allows a replicated service to tolerate arbitrary behavior from faulty replicas—behavior caused by a software bug or an attack. Abstraction hides implementation details to enable the reuse of offthe-shelf implementations of important services (e.g., file systems, databases, or HTTP daemons) and to improve the ability to mask software errors. We extended the BFT library [Castro and Liskov 1999, 2000, 2002] to implement BASE. (BASE is an acronym for BFT with Abstract Specification Encapsulation.) The original BFT library provides Byzantine fault tolerance with good performance and strong correctness guarantees if no more than one-third of the replicas fail within a small window of vulnerability. However, it requires all replicas to run the same service implementation and to update their state in a deterministic way. Therefore, it cannot tolerate deterministic software errors that cause all replicas to fail concurrently and it complicates reuse of existing service implementations because it requires extensive modifications to ensure identical values for the state of each replica. The BASE library and methodology described in this paper correct these problems—they enable replicas to run different or nondeterministic implementations. The methodology is based on the concepts of abstract specification and abstraction function from work on data abstraction [Liskov and Guttag 2000]. We start by defining a common abstract specification for the service, which specifies an abstract state and describes how each operation manipulates the state. Then we implement a conformance wrapper for each distinct implementation to make it behave according to the common specification. The last step is to implement an abstraction function (and one of its inverses) to map from the concrete state of each implementation to the common abstract state (and vice versa). The methodology offers several important advantages: —Reuse of existing code. BASE implements a form of state machine replication [Lamport 1978; Schneider 1990] that allows replication of services that perform arbitrary computations, but requires determinism: all replicas must produce the same sequence of results when they process the same sequence of operations. Most off-the-shelf implementations of services fail to satisfy this condition. For example, many implementations produce timestamps by reading local clocks, which can cause the states of replicas to diverge. The conformance wrapper and the abstract state conversions enable the reuse of existing implementations. Furthermore, these implementations can be nondeterministic, which reduces the probability of common mode failures. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 238 • M. Castro et al. — Software rejuvenation through proactive recovery. It has been observed [Huang et al. 1995] that there is a correlation between the length of time software runs and the probability that it fails. BASE combines proactive recovery [Castro and Liskov 2000] with abstraction to counter this problem. Replicas are recovered periodically even if there is no reason to suspect they are faulty. Recoveries are staggered so that the service remains available during rejuvenation to enable frequent recoveries. When a replica is recovered, it is rebooted and restarted from a clean state. Then it is brought up to date using a correct copy of the abstract state that is obtained from the group of replicas. Abstraction may improve availability by hiding corrupt concrete states, and it enables proactive recovery when replicas do not run the same code or run code that is nondeterministic. — Opportunistic N-version programming. Replication is not useful when there is a strong positive correlation between the failure probabilities of the different replicas, for example, deterministic software bugs cause all replicas to fail at the same time when they run the same code. N-version programming [Chen and Avizienis 1978] exploits design diversity to reduce the probability of correlated failures, but it has several problems [Gray and Siewiorek 1991]: it increases development and maintenance costs by a factor of N or more, adds unacceptable time delays to the implementation, and does not provide a mechanism to repair faulty replicas. BASE enables an opportunistic form of N-version programming by allowing us to take advantage of distinct, off-the-shelf implementations of common services. This approach overcomes the defects mentioned above: it eliminates the high development and maintenance costs of N-version programming, and also the long time-to-market. Additionally, we can repair faulty replicas by transferring an encoding of the common abstract state from correct replicas. Opportunistic N-version programming may be a viable option for many common services, for example, relational databases, HTTP daemons, file systems, and operating systems. In all these cases, competition has led to four or more distinct implementations that were developed and are maintained separately but have similar (although not identical) functionality. Since each off-the-shelf implementation is sold to a large number of customers, the vendors can amortize the cost of producing a high quality implementation. Furthermore, the existence of standard protocols that provide identical interfaces to different implementations, e.g., ODBC [Geiger 1995] and NFS [RFC-1094 1989], simplifies our technique and keeps the cost of writing the conformance wrappers and state conversion functions low. We can also leverage the effort towards standardizing data representations using XML [W3C 2000]. This paper explains the methodology by giving two examples, a replicated file service where replicas run different operating systems and file systems, and a replicated object-oriented database, where the replicas run the same implementation but the implementation is nondeterministic. The paper also provides an evaluation of the methodology based on these examples; we evaluate the complexity of the conformance wrapper and state conversion functions and the overhead they introduce. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 239 The remainder of the paper is organized as follows. Section 2 describes our methodology and the BASE library. Section 3 explains how we applied the methodology to build the replicated file system and object-oriented database. We evaluate our technique in Section 4. Section 5 discusses related work and Section 6 presents our conclusions. 2. THE BASE TECHNIQUE This section provides an overview of our replication technique. It starts by describing the methodology that we use to build a replicated system from existing service implementations. Then it describes the replication algorithm that we use and it ends with a description of the BASE library. 2.1 Methodology The goal is to build a replicated system by reusing a set of off-the-shelf implementations, I1 , . . . , In , of some service. Ideally, we would like n to equal the number of replicas so that each replica can run a different implementation to reduce the probability of simultaneous failures. But the technique is useful even with a single implementation. Although off-the-shelf implementations of the same service offer roughly the same functionality, they behave differently: they implement different specifications, S1 , . . . , Sn , using different representations of the service state. Even the behavior of different replicas that run the same implementation may be different when the specification they implement is not strong enough to ensure deterministic behavior. For example, the NFS specification [RFC-1094 1989] allows implementations to choose the value of file handles arbitrarily. BASE, like any form of state machine replication, requires determinism: replicas must produce the same sequence of results when they execute the same sequence of operations. We achieve determinism by defining a common abstract specification, S, for the service that is strong enough to ensure deterministic behavior. This specification defines the abstract state, an initial state value, and the behavior of each service operation. The specification is defined without knowledge of the internals of each implementation. It is sufficient to treat them as black boxes, which is important to enable the use of existing implementations. Additionally, the abstract state captures only what is visible to the client rather than mimicking what is common in the concrete states of the different implementations. This simplifies the abstract state and improves the effectiveness of our software rejuvenation technique. The next step, is to implement conformance wrappers, C1 , . . . , Cn , for each of I1 , . . . , In . The conformance wrappers implement the common specification S. The implementation of each wrapper Ci is a veneer that invokes the operations offered by Ii to implement the operations in S; in implementing these operations this veneer makes use of a conformance representation that stores whatever additional information is needed to allow the translation from the concrete behavior of the implementation to the abstract behavior. The conformance wrapper also implements some additional methods ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 240 • M. Castro et al. that allow a replica to be shut down and then restarted without loss of information. The final step is to implement the abstraction function and one of its inverses. These functions allow state transfer among the replicas. State transfer is used to repair faulty replicas, and also to bring slow replicas up-to-date when messages they are missing have been garbage collected. For state transfer to work, replicas must agree on the value of the state of the service after executing a sequence of operations; they will not agree on the value of the concrete state but our methodology ensures that they will agree on the value of the abstract state. The abstraction function is used to convert the concrete state stored by a replica into the abstract state, which is transferred to another replica. The receiving replica uses the inverse function to convert the abstract state into its own concrete state representation. To enable efficient state transfer between replicas, the abstract state is defined as an array of objects. The array has a fixed maximum size, but the objects it contains can vary in size. We explain how this representation enables efficient state transfer in Section 2.3. There is an important trend that makes it easier to apply the methodology. Market forces pressure vendors to offer interfaces that are compliant with standard specifications for interoperability, for example, ODBC [Geiger 1995]. Usually, a standard specification S ′ cannot be used as the common specification S because it is too weak to ensure deterministic behavior. But it can be used as a basis for S and, because S and S ′ are similar, it is relatively easy to implement conformance wrappers and state conversion functions, and these implementations can be reused across implementations. This is illustrated by the replicated file system example in Section 3. In this example, we take advantage of the NFS standard by using the same conformance wrapper and state conversion functions to wrap different implementations. 2.2 The BFT Algorithm BFT [Castro and Liskov 1999, 2000, 2002] is an algorithm for state machine replication [Lamport 1978; Schneider 1990] that offers both liveness and safety ⌋ out of a total of n replicas are faulty. This means that provided at most ⌊ n−1 3 clients eventually receive replies to their requests and those replies are correct according to linearizability [Herlihy and Wing 1987; Castro 2000]. BFT is safe in asynchronous systems like the Internet: it does not rely on any synchrony assumption to provide safety. In particular, it never returns bad replies even in the presence of denial-of-service attacks. Additionally, it guarantees liveness provided message delays are bounded eventually. The service may be unable to return replies when a denial of service attack is active but clients are guaranteed to receive replies when the attack ends. Since BFT is a state-machine replication algorithm, it has the ability to replicate services with complex operations. This is an important defense against Byzantine-faulty clients: operations can be designed to preserve invariants on the service state, to offer narrow interfaces, and to perform access control. BFT provides safety regardless of the number of faulty clients and the safety ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 241 property ensures that faulty clients are unable to break these invariants or bypass access controls. There is also a proactive recovery mechanism for BFT that recovers replicas periodically even if there is no reason to suspect that they are faulty. This allows the replicated system to tolerate any number of faults over the lifetime of the system provided fewer than one-third of the replicas become faulty within a window of vulnerability. The basic idea in BFT is simple. Clients send requests to execute operations and all nonfaulty replicas execute the same operations in the same order. Since replicas are deterministic and start in the same state, all nonfaulty replicas send replies with identical results for each operation. The client chooses the result that appears in at least f + 1 replies. The hard problem is ensuring nonfaulty replicas execute the same requests in the same order. BFT uses a combination of primary backup and quorum replication techniques to order requests. Replicas move through a succession of numbered configurations called views. In a view, one replica is the primary and the others are backups. The primary picks the execution order by proposing a sequence number for each request. Since the primary may be faulty, the backups check the sequence numbers and trigger view changes to select a new primary when it appears that the current one has failed. BFT incorporates a number of important optimizations that allow the algorithm to perform well so that it can be used in practice. The most important optimization is the use of symmetric cryptography to authenticate messages. Public-key cryptography is used only to exchange the symmetric keys. Other optimizations reduce the communication overhead: the algorithm uses only one message round trip to execute read-only operations and two to execute read-write operations, and it uses batching under load to amortize the protocol overhead for read-write operations over many requests. The algorithm also uses IP multicast and other optimizations to reduce protocol overhead as the operation argument and result sizes increase. Additionally, it provides efficient techniques to garbage collect protocol information, and to transfer state to bring replicas up-to-date. BFT has been implemented as a generic program library with a simple interface. The BFT library can be used to provide Byzantine-fault-tolerant versions of different services. It has been used to implement Byzantine-fault-tolerant distributed file system, BFS, which supports the NFS protocol. A detailed performance evaluation of the BFT library and BFS appears in Castro [2000] and Castro and Liskov [2002]. This evaluation includes results of micro-benchmarks to measure performance in the normal case, during view changes, and during state transfers. It also includes experiments to evaluate the impact of each optimization used by BFT. Furthermore, these experimental results are backed by an analytical performance model [Castro 2000]. 2.3 The BASE Library The BASE library extends BFT with the features necessary to support the methodology. The BFT library requires all replicas to run the same service ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 242 • M. Castro et al. Fig. 1. BASE interface and upcalls. implementation and to update their state in a deterministic way, which complicates reuse of existing service implementations because it requires extensive modifications to ensure identical values for the state of each replica. The BASE library corrects these problems. Figure 1 presents its interface. The invoke procedure is called by the client to invoke an operation on the replicated service. This procedure carries out the client side of the replication protocol and returns the result when enough replicas have responded. When the library needs to execute an operation at a replica, it makes an upcall to an execute procedure that is implemented by the conformance wrapper for the service implementation run by the replica. State transfer. To perform state transfer in the presence of Byzantine faults, it is necessary to be able to prove that the state being transferred is correct. Otherwise, faulty replicas could corrupt the state of out-of-date but correct replicas. (A detailed discussion of this point can be found in Castro and Liskov [2000].) Consequently, replicas cannot discard a copy of the state produced after executing a request until they know that the state produced by executing later requests can be proven correct. Replicas could keep a copy of the state after executing each request but this would be too expensive. Instead replicas keep just the current version of the concrete state plus copies of the abstract state produced every k-th request (e.g., k = 128). These copies are called checkpoints. Replicas inform each other when they produce a checkpoint and the library only transfers checkpoints between replicas. Creating checkpoints by making full copies of the abstract state would be too expensive. Instead, the library uses copy-on-write such that checkpoints only contain the differences relative to the current abstract state. Similarly, ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 243 transferring a complete checkpoint to bring a recovering or out-of-date replica up to date would be too expensive. The library employs a hierarchical state partition scheme to transfer state efficiently. When a replica is fetching state, it recurses down a hierarchy of meta-data to determine which partitions are out-of-date. When it reaches the leaves of the hierarchy (which are the abstract objects), it fetches only the objects that are corrupt or out-of-date. This is described in detail in Castro and Liskov [2000]. As mentioned earlier, to implement checkpointing and state transfer efficiently, we require that the abstract state be encoded as an array of objects, where the objects can have variable size. This representation allows state transfer to be done on just those objects that are out-of-date or corrupt. The current implementation of the BASE library requires the array to have a fixed size. This limits flexibility in the definition of encodings for the abstract state but it is not an intrinsic problem. The maximum number of entries in the array can be set to an extremely large value without allocating extra space for the portion of the array that is not used, and without degrading the performance of state transfer and checking. To implement state transfer, each replica must provide the library with two upcalls, which implement the abstraction function and one of its inverses. These upcalls do not convert the entire state each time they are called because this would be too expensive. Instead, they perform conversions at the granularity of an object in the abstract state array. The abstraction function is implemented by get obj. It receives an object index i, allocates a buffer, obtains the value of the abstract object with index i, and places that value in the buffer. It returns the size for that object and a pointer to the buffer. The inverse abstraction function receives a new abstract state value and updates the concrete state to match this argument. This function should also work incrementally to achieve good performance. But it cannot process just one abstract object per invocation because there may be invariants on the abstract state that create dependencies between objects. For example, suppose that an object in the abstract state of a file system can be either a file or a directory. If a slow replica misses the operations that create a directory, d , and a file, f , in d , it has to fetch the abstract objects corresponding to d and f from the others. Then, it invokes the inverse abstraction function to bring its concrete state upto-date. If f is the argument to the first invocation and d is the argument to the second, it is impossible for the first invocation to update the concrete state because it has no information on where to create the file. The reverse order does not work either because the first invocation creates a dangling reference in d . To solve this problem, put objs receives a vector of objects with the corresponding sizes and indices in the abstract state array. The library guarantees that this upcall is invoked with an argument that brings the abstract state of the replica to a consistent value (i.e., the value of a valid checkpoint). Each time the execute upcall is about to modify an object in the abstract state it is required to invoke a modify procedure, which is supplied by the library, passing the object index as argument. This is used to implement copy-onwrite to create checkpoints incrementally. When modify is invoked, the library checks if it has saved a copy of the object since the last checkpoint was taken. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 244 • M. Castro et al. If it has not, it calls get obj with the appropriate index and saves the copy of the object until the corresponding checkpoint can be discarded. It may be difficult to identify which parts of the service state will be modified by an operation before it runs. Some services provide support to determine which objects are modified by an operation and to access their previous state, for example, relational databases. For other services, it is always possible to use coarse-grained objects and to call modify conservatively; this makes it simpler to write correct code at the expense of some performance degradation. Non-determinism. BASE implements a form of state machine replication that requires replicas to behave deterministically. Our methodology uses abstraction to hide most of the nondeterminism in the implementations it reuses. However, many services involve forms of nondeterminism that cannot be hidden by abstraction. For instance, in the case of the NFS service, the time-lastmodified for each file is set by reading the server’s local clock. If this were done independently at each replica, the states of the replicas would diverge. Instead, we allow the primary replica to propose values for non-deterministic choices by providing the propose value upcall, which is only invoked at the primary. The call receives the client request and the sequence number for that request; it selects a non-deterministic value and puts it in non-det. This value is going to be supplied as an argument of the execute upcall to all replicas. The protocol implemented by the BASE library prevents a faulty primary from causing replica state to diverge by sending different values to different backups. However, a faulty primary might send the same, incorrect value to all backups, subverting the system’s desired behavior. The solution to this problem is to have each replica implement a check value function that validates the choice of non-deterministic values that was made by the primary. If one-third or more non-faulty replicas reject a value proposed by a faulty primary, the request will not be executed and the view change mechanism will cause the primary to be replaced soon after. Recovery. Proactive recovery periodically restarts each replica from a correct, up-to-date checkpoint of the abstract state that is obtained from the other replicas. Recoveries are triggered by a watchdog timer. When this timer fires, the replica reboots after saving to disk the abstract service state, and the replication protocol state, which includes abstract objects that were copied by the incremental checkpointing mechanism. The library could invoke get obj repeatedly to save a complete copy of the abstract state to disk but this would be expensive. It is sufficient to ensure that the current concrete state is on disk and to save a small amount of additional information to enable reconstruction of the conformance representation when the replica restarts. Since the library does not have access to this representation, the service state is saved to a file by an additional upcall, shutdown, that is implemented by the conformance wrapper. The conformance wrapper also implements a restart upcall that is invoked to reconstruct the conformance representation from the file saved by shutdown and from the concrete state of the service. This enables the replica to compute the abstract state by calling get obj after restart completes. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 245 Fig. 2. BASE function calls and upcalls. In some cases, the information in the conformance representation is volatile; it is no longer valid when the replica restarts. In this case, it is necessary to augment it with information that is persistent and allows restart to reconstruct the conformance representation after a reboot. After calling restart, the library uses the hierarchical state transfer mechanism to compare the value of the abstract state of the replica with the abstract state values stored by the other replicas. It computes cryptographic hashes of the abstract objects and compares them with the hashes in the state partition tree to check if the objects are corrupt. The state partition tree also contains the sequence number of the last checkpoint when each object was modified [Castro and Liskov 2000]. The replica uses this information to check which objects are out-of-date without having to compute their hash. These checks are performed in parallel with fetches of objects that have already been determined to be outof-date or corrupt. This is efficient: the replica fetches only the value of objects that are out-of-date or corrupt. We use a single threaded implementation with event queues representing objects to fetch and objects to check. Checks are performed while waiting for replies to fetch requests. The replica does not execute operations until it completes the recovery but the other replicas continue to process requests [Castro and Liskov 2000]. Proactive recovery allows automatic recovery of faulty replicas from many failures. However, hardware failures may prevent replicas from recovering automatically. In this case, administrators should repair or replace the faulty replica and then trigger the recovery mechanism to bring it up to date. The object values fetched by the replica could be supplied to put objs to update the concrete state, but the concrete state might still be corrupt. For example, an implementation may have a memory leak and simply calling put objs will not free unreferenced memory. In fact, implementations will not typically offer an interface that can be used to fix all corrupt data structures in their concrete state. Therefore, it is better to restart the implementation from a clean initial concrete state and use the abstract state to bring it up-to-date. A global view of all BASE functions and upcalls that are invoked is shown in Figure 2. 2.4 Limitations In theory, the methodology can be used to build a replicated service from any set of existing implementations of any service. But sometimes this may be hard because of the following problems. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 246 • M. Castro et al. Undocumented behavior. To apply the methodology, we need to understand and model the behavior of each service implementation. We do not need to model low level implementation details but only the behavior that can be observed by the clients of that implementation. We believe that the behavior of most software is well documented at this level, and we can use black box testing to understand small omissions in the documentation and small deviations from documented behavior. Implementations whose behavior we cannot model are unlikely to be of much use. It may be possible to remove operations whose behavior is not well documented from the abstract specification, or to implement these operations entirely in the conformance wrapper. Very different behavior. If the implementations used to build the service behave very differently, any common abstract specification will deviate significantly from the behavior of some implementations. Theoretically, it is possible to write arbitrarily complex conformance wrappers and state conversion functions to bridge the gap between the behavior of the different implementations and the common abstract specification. In the worst case, we could implement the entire abstract specification in the wrapper code. But this is not practical because it is expensive to write complex wrappers, and complex wrappers are more likely to introduce new bugs. Therefore, it is important to use a set of implementations with similar behavior. Narrow interfaces. The external interface of some implementations may not allow the wrapping code to read or write data that has an impact on the behavior observed by the client. For example, databases do not usually provide interfaces to manipulate concurrency control state, which influences observable behavior. There are three options in this case. First, the data can be shadowed in the conformance wrapper. This is practical if it is a small amount of data that is simple to maintain. Second, it may be possible to change the abstract specification such that this data has no impact on the behavior observed by the client. Third, it may be possible to gain access to internal APIs that avoid the problem. Concurrency. It is common for service implementations to process requests concurrently to improve performance. Additionally, these implementations can provide different consistency guarantees for concurrent execution. For example, databases provide different degrees of isolation [Gray and Reuter 1993] to allow applications to trade off consistency for performance. Concurrent request execution results in a form of nondeterminism that is visible to the clients of the service and, therefore, needs to be constrained to apply our replication methodology (or any other form of state machine replication as discussed, for example, in Kemme and Alonso [2000]; Amir et al. [2002]; Jiménez-Peris et al. [2002]). It is nontrivial to ensure deterministic behavior without degrading performance. There are two basic approaches to solve this problem: implementing concurrency control in the conformance wrapper, or modifying the concurrency control code in the service implementation. The conformance wrapper can implement concurrency control by determining which requests conflict and by not issuing a request to the service if it ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 247 conflicts with a request that has a smaller sequence number and has not yet completed. This approach treats the underlying service implementation as a black box and it works even when the service provides weak consistency guarantees for concurrent request execution or when different implementations provide different guarantees. However, determining conflicts before executing requests may be hard in some services. It is easy in the two examples presented in this article: file systems and client-server databases that ship transaction read and write sets back to the server. But it is harder in other systems, for example, relational databases where servers receive transactions with arbitrary sequences of SQL statements. The wrapper can conservatively assume that all requests conflict, which is simple to implement and solves the problem, but can result in poor performance. The work in Amir et al. [2002] and Jiménez-Peris et al. [2002] issues requests to a database one at a time to ensure determinism. Alternatively, the concurrency control code in the service implementation can be modified to ensure that conflicting requests are serialized in order of increasing sequence number. This has the disadvantage of requiring nontrivial changes to the service implementation and it does not work with weak consistency guarantees. The work in Kemme and Alonso [2000] describes how to modify a relational database to achieve something similar. 3. EXAMPLES This section uses two examples to illustrate the methodology: a replicated file system and an object oriented database. 3.1 File System The file system is based on the NFS protocol [RFC-1094 1989]. Its replicas can run different operating systems and file system implementations. This allows them to tolerate software errors not only in the file system implementation but also in the rest of the operating system. 3.1.1 Abstract Specification. The common abstract specification is based on the specification of version 2 of the NFS protocol [RFC-1094 1989]. The abstract file service state consists of a fixed-size array of pairs containing an object and a generation number. Each object has a unique identifier, oid, which is obtained by concatenating its index in the array and its generation number. The generation number is incremented every time the entry is assigned to a new object. There are four types of objects: — files, whose data is a byte array with the file contents; — directories, whose data is a sequence of <name, oid> pairs ordered lexicographically by name; — symbolic links, whose data is a small character string; — special null objects, which indicate that an entry is free. In addition to data, all non-null objects have meta-data, which includes the attributes in the NFS fattr structure, and the index (in the array) of its parent directory. Each entry in the array is encoded using XDR [RFC-1014 1987]. The ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 248 • M. Castro et al. Fig. 3. Software architecture. object with index 0 is a directory object that corresponds to the root of the file system tree that was mounted. Keeping a pointer to the parent directory is redundant, since we can derive this information by scanning the rest of the abstract state. But it simplifies the inverse abstraction function and the recovery algorithm, as we will explain later. The operations in the common specification are those defined by the NFS protocol. There are operations to read and write each type of non-null object. The file handles used by the clients are the oids of the corresponding objects. To ensure deterministic behavior, we require that oids be assigned deterministically, and that directory entries returned to a client be ordered lexicographically. Some errors in the NFS protocol depend on the environment, for example, NFSERR NOSPC is returned when the disk is full. The common abstract specification virtualizes the environment to ensure deterministic error processing. For example, the abstract state records the total number of bytes in abstract objects and a maximum capacity in bytes for the replicated file system. The abstract operations compare these values to decide when to raise NFSERR NOSPC. The abstract state also records the maximum file size and name size to process NFSERR FBIG and NFSERR NAMETOOLONG deterministically. These maximum sizes and the disk capacity must be such that no correct concrete implementation raises the errors if they are not exceeded in the abstract state. The abstraction hides many details: the allocation of file blocks, the representation of large files and directories, and the persistent storage medium and how it is accessed. This is desirable for simplicity and performance. Additionally, abstracting from implementation details like resource allocation improves resilience to sofware faults due to aging because proactive recovery can fix resource leaks. 3.1.2 Conformance Wrapper. There is a conformance wrapper around each implementation to ensure that it behaves according to the common specification. The conformance wrapper for the file service processes NFS protocol operations and interacts with an off-the-shelf file system implementation using the NFS protocol as illustrated in Figure 3. A file system exported by the replicated ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 249 file service is mounted on the client machine like any regular NFS file system. Application processes run unmodified and interact with the mounted file system through the NFS client in the kernel. We rely on user level relay processes to mediate communication between the standard NFS client and the replicas. A relay receives NFS protocol requests, calls the invoke procedure of our replication library, and sends the result back to the NFS client. The replication library invokes the execute procedure implemented by the conformance wrapper to run each NFS request. This architecture is similar to BFS [Castro and Liskov 1999]. The conformance representation consists of an array that corresponds to the one in the abstract state but it does not store copies of the objects; instead each array entry contains the type of object, the generation number, and for non-empty entries it also contains the file handle assigned to the object by the underlying NFS server, the value of the timestamps in the object’s abstract meta-data, and the index of the parent directory. The abstract objects’ data and remaining meta-data attributes are computed from the concrete state when necessary. The representation also contains a map from file handles to oids to aid in processing replies efficiently. The wrapper processes each NFS request received from a client as follows. It translates the file handles in the request, which encode oids, into the corresponding NFS server file handles. Then it sends the modified request to the underlying NFS server. The server processes the request and returns a reply. The wrapper parses the reply and updates the conformance representation. If the operation created a new object, the wrapper allocates a new entry in the array in the conformance representation, increments the generation number, and updates the entry to contain the file handle assigned to the object by the NFS server and the index of the parent directory. If any object is deleted, the wrapper marks its entry in the array free. In both cases, the reverse map from file handles to oids is updated. The wrapper must also update the abstract timestamps in the array entries corresponding to objects that were accessed. For this, it uses the value for the current clock chosen by the primary using the propose value upcall in order to prevent the states of the replicas from diverging. However, if a faulty primary chooses an incorrect value the system could have an incorrect behavior. For example, the primary might always propose the same value for the current time; this would cause all replicas to update the modification time to the same value that it previously held and therefore, according to the cache consistency protocol implemented by most NFS clients [Callaghan 1999], cause the clients to erroneously not invalidate their cached data, thus leading to inconsistent values at the caches of different clients. The solution to this problem is to have each replica validate the choice for the current timestamp using the check value function. This function checks that the proposed timestamp is within a specified delta of the replica’s own clock value, and that the timestamps produced by the primary are monotonically increasing. This always guarantees safety: all replicas agree on the timestamp of each operation, the timestamp is close to the clock reading of at least one correct replica, and timestamps are monotonically increasing. But we rely on loosely synchronized ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 250 • M. Castro et al. Fig. 4. Example of the abstraction function. clocks for liveness, which is reasonable if replicas use an algorithm like NTP [Mills 1992]. Finally, the wrapper returns a modified reply to the client, using the map to translate file handles to oids and replacing the concrete timestamp values by the abstract ones. When handling readdir calls the wrapper reads the entire directory and sorts it lexicographically to ensure that the client receives identical replies from all replicas. In the current implementation, the conformance wrapper issues read-write requests to the service one at a time to ensure that they are serialized in order of increasing sequence number at all the replicas. Read-only requests are processed using BFT’s read-only optimization [Castro and Liskov 2002] and may be processed concurrently. We could improve performance by implementing a simple form of concurrency control in the wrapper and allowing non-conflicting read-write requests to execute concurrently. This wrapper is simple and small, which is important because it reduces the likelihood of introducing additional software errors, and its implementation can be reused for all NFS server implementations. 3.1.3 State Conversions. The abstraction function in the file service is implemented as follows. For each file system object, it uses the file handle stored in the conformance representation to invoke the NFS server to obtain the data and meta-data for the object. Then it replaces the concrete timestamp values by the abstract ones, converts the file handles in directory entries to oids, and sorts the directories lexicographically. Figure 4 shows how the concrete state and the conformance representation are combined to form the abstract state for a particular example. Note that the attributes in the concrete state are combined with the timestamps in the ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 251 Fig. 5. Inverse abstraction function. conformance representation to form the attributes in the abstract state. Also note that the contents of the files and directories are not stored by the conformance representation, but only in the concrete state. The pseudo code for the inverse abstraction function in the file service is shown in Figure 5. This function receives an array with the indices of the objects that need to be updated and the new values for those objects. It scans each entry in the array to determine the type of the new object, and acts accordingly. If the new object is a file or a symbolic link, it starts by calling the update directory function, passing the new object’s parent directory index as an argument. This will cause the object’s parent directory to be reconstructed if needed, and the corresponding object in the underlying file system will be created if it did not exist already. Then it can update the object’s entry in the conformance representation, and issue a setattr and a write to update the file’s meta-data and data in the concrete state. For symbolic links, it is sufficient to update their meta-data. When the new object is a directory, it is sufficient to invoke update directory passing its own index as an argument, and then updating the appropriate entry in the conformance representation. Finally, if the new object is a free entry it updates the conformance representation to reflect the new object’s type and generation number. If the entry was not previously free, it must also remove the mapping from the file handle that ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 252 • M. Castro et al. was stored in that entry to its oid. We do not have to update the parent directory of the old object, since it must have changed and will eventually be processed. The update directory function can be summarized as follows. If the directory that is being updated has already been updated or is not in the array of objects that need to be updated then the function performs no action. Otherwise it calls itself recursively passing the index of the parent directory (taken from the new object) as an argument. Then, it looks up the contents of the directory by issuing a readdir call. It scans the entries in the old state to remove the ones that are no longer present in the abstract state (or have a different type) and finally scans the entries in the new abstract state and creates the ones that are not present in the old state. When an entry is created or deleted, the conformance representation is updated to reflect this. 3.1.4 Proactive Recovery. After a recovery, a replica must be able to restore its abstract state. This could be done by saving the entire abstract state to disk before the recovery, but that would be very expensive. Instead we want to save only the metadata (e.g., the oids and the timestamps). But to do this we need a way of relating the oids to the files in the concrete file system state. This cannot be done using file handles since they can change when the NFS server restarts. However, the NFS specification states that each object is uniquely identified by a pair of meta-data attributes: <fsid,fileid>. We solve the problem by adding another component to the conformance representation: a map from <fsid,fileid> pairs to the corresponding oids. The shutdown method saves this map (as well as the metadata maintained by the conformance representation for each file) to disk. After rebooting, the restart method performs the following steps. It reads the map from disk; performs a new mount RPC call, thus obtaining the file handle for the file system root; and places null file handles in all the other entries in the conformance representation that correspond to all the other objects, indicating that we do not know the new file handles for those objects yet. It then initializes the other entries using the metadata that was stored by shutdown. Then the replication library runs the protocol to bring the abstract state of the replica up to date. As part of this process, it updates the digests in its partition tree using information collected from the other replicas and calls get obj on each object to check if it has the correct digest. This checks the integrity not only of file and directory contents but also of all their meta-data. Corrupt or out-of-date objects are fetched from the other replicas. The call to get obj determines the new NFS file handle if necessary. In this case, it goes up the directory tree (using the parent index in the conformance representation) until it finds a directory whose new file handle is already known. Then it issues a readdir to learn the names and fileids of the entries in the directory, followed by a lookup call for each one of those entries to obtain their NFS file handles; these handles are then stored in the array position that is determined by the <fsid,fileid> to oid map. Then it continues down the path of the object whose file handle is being reconstructed, computing not only the file handles of the directories in that path, but also those of all their siblings in the tree. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 253 When walking up the directory tree using the parent indices, we need to detect loops so that the recovery function will not enter an infinite loop due to erroneous information stored by the replica during shutdown. Currently, we restart the NFS server in the same file system and update its state with the objects fetched from other replicas. We could change the implementation to start an NFS server on a second empty disk and bring it up to date incrementally as we obtain the values of the abstract objects. This has the advantage of improving fault-tolerance as discussed in Section 2. Additionally, it can improve disk locality by clustering blocks from the same file and files that are in the same directory. 3.2 Object-Oriented Database We have also applied our methodology to replicate the servers in the Thor objectoriented database [Liskov et al. 1999]. In this example, all the replicas run the same server implementation. The example is interesting because the service is more complex than NFS, and the server implementation is multithreaded and exhibits a significant degree of non-determinism. The methodology enabled reuse of the existing server code and could enable software rejuvenation through proactive recovery. We begin by giving a brief overview of Thor and then describe how the methodology was applied in this example. A more detailed description can be found in Rodrigues [2001]. 3.2.1 System Overview. Thor [Liskov et al. 1999] provides a persistent object store that can be shared by applications running concurrently at different locations. It guarantees type-safe sharing by ensuring that all objects are used in accordance with their types. Additionally, it provides atomic transactions [Gray and Reuter 1993] to guarantee strong consistency in the presence of concurrent accesses and crashes. Thor is implemented as a client/server system in which servers provide persistent storage for objects and applications run at clients on cached copies of persistent objects. Servers store objects in pages on disk and also cache these pages in main memory. Each object stored by a server is identified by a 32-bit oref, which contains a pagenum that identifies the page where the object is stored and an onum that identifies the object within the page. Objects are uniquely identified by a pair containing the object oref and the identifier of the server that stores the object. Each client maintains a cache of objects retrieved from servers in main memory [Castro et al. 1997]. Applications running at the client invoke methods on these cached objects. When an application requests an object that is not cached, the client fetches the page that contains the object from the corresponding server. Thor uses an optimistic concurrency control algorithm [Adya et al. 1995] to serialize [Gray and Reuter 1993] transactions. Clients run transactions on cached copies of objects assuming that these copies are up-to-date but record orefs of objects read or modified by the transaction. To commit a transaction, ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 254 • M. Castro et al. the client sends a commit request to the server that stores these objects. (Thor uses a two-phase commit protocol [Gray and Reuter 1993] when transactions access objects at multiple servers but we will not describe this case to simplify the presentation.) The commit request includes a transaction timestamp assigned by the client, the orefs it recorded, and the new values of modified objects. The server attempts to serialize transactions in order of increasing timestamps. To determine if a transaction can commit, the server uses a validation queue (VQ) and invalid sets (ISs). The VQ contains an entry for every transaction that committed recently. Each entry contains the orefs of objects that were read or modified by the transaction, and the transaction’s timestamp. There is an IS for each active client that lists orefs of objects with stale copies in that client’s cache. A transaction can commit if none of the objects it accessed is in the corresponding IS, if it did not modify an object that was read by a committed transaction in the VQ with a later timestamp, and if it did not read an object that was modified by a transaction in the VQ with a later timestamp. If the transaction commits, its effects are recorded persistently; otherwise, it has no effect. In either case, the server informs the client of its decision. The server updates the ISs of clients when a transaction commits. It adds orefs of objects modified by the transaction to the ISs of clients that are caching those objects. It computes this set of clients using a cached-pages directory that maps each page in the server to the set of clients that may have cached copies of that page. The server adds clients to the directory when they fetch pages and clients piggyback information about pages that they have discarded in fetch and commit requests that they send to the server. Similarly, the servers piggyback invalidation messages on fetch and commit replies to inform clients of objects in their IS. An object is removed from a client’s IS when the server receives an acknowledgement for the invalidation. These acknowledgements are also piggybacked on other messages. When a transaction commits, clients send new versions of modified objects but not their containing pages. These objects are stored by the server in a modified object buffer (MOB) [Ghemawat 1995] that allows the server to defer installing these objects to their pages on disk. The modifications are installed to disk lazily by a background flusher thread when the MOB is almost full to make room for new modifications. 3.2.2 Abstract Specification. We applied our methodology to replicate Thor servers. The abstract specification models the behavior of these servers as seen by clients. The interface exported by servers has four main operations: start session and end session, which are used by clients to start and end sessions with a server; and fetch and commit, which were described before. Invalidations are piggybacked on fetch and commit replies, and invalidation acknowledgments and notifications of page evictions from client caches are piggybacked on fetch and commit requests. The abstract state of the service is defined as follows. The array of abstract objects is partitioned into four fixed-size areas. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 255 Database pages. Each entry in this area corresponds to a page in the database. The value of the entry with index i is equal to the value of the page with pagenum i. Validation queue. Entries in this area correspond to entries in the VQ. The value of each entry contains the timestamp that was assigned to the corresponding transaction (or a null timestamp for free entries), the status of the transaction, an array with the orefs of objects read by the transaction, and an array with the orefs of objects written by the transaction. When a transaction commits, it is assigned the free entry with the lowest index. When there are no free entries, the entry of the transaction with the lowest timestamp t is discarded to free an entry for a new transaction and any transaction that attempts to commit with timestamp lower than t is aborted [Liskov et al. 1999]. Note that entries are not ordered by timestamp because this could lead to inefficient checkpoint computation and state transfer. Inserting entries in the middle of an ordered sequence could require shifting a large number of entries. This would increase the cost of our incremental checkpointing technique and could increase the amount of data sent during state transfers. Invalid sets. Each entry in this area corresponds to the invalid set of an active client. The value of an entry contains the client identifier (or a null identifier for free entries), and an array with the orefs of invalid objects. When a new client invokes start session, it is assigned an abstract client number that corresponds to the index of its entry in this area. The entry is discarded when the client invokes end session. Cached-pages directory. There is one entry in this area per database page. The index of an entry is equal to the pagenum of the corresponding page minus the starting index for the area. The value of an entry is an array with the abstract numbers of clients that cache the page. The abstraction hides the details of how the page cache and the MOB are managed at the servers. This allows different replicas to cache different pages, or install objects to disk pages at different times without having their abstract states diverge. 3.2.3 Conformance Wrapper. Thor servers illustrate one of the problems that make applying our methodology harder. The external interface they offer is too narrow to implement state conversion functions that are both simple and efficient. For example, the interface between clients and servers does not allow reading or writing the validation queue, the invalid sets, or the cached-pages directory. We could solve this problem by shadowing this data in the wrapper but this is not practical because it would require reimplementing the concurrency control algorithm. Instead, we implemented the state conversion functions using internal APIs. This was possible because we had access to the server source code. We used these internal APIs as black boxes; we did not add new operations or change existing operations. These internal APIs were used only to implement the state conversion functions. They were not used to define the abstract ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 256 • M. Castro et al. specification. This is important because we want this specification to abstract as many implementation details as possible. We also replaced the communication library used between servers and clients by one with the same interface that calls the BASE library. This avoids the need for interposing client and server proxies, which was the technique we used in the file system example. The conformance wrapper maintains only two data structures: the VQ array and the client array, which are used in the state conversion functions as we will describe next. Each entry in the VQ array corresponds to the entry with the same index in the VQ area of the abstract state, and it contains the transaction timestamp in that abstract entry. When a transaction commits, the wrapper assigns it an entry in the VQ array (as described in the abstract specification) and stores its timestamp there. The entries in the client array are used to map abstract client numbers to the per-client data structures maintained by Thor. They are updated by the wrapper when clients start and end sessions with the server. In Thor, transaction timestamps are assigned by clients. The conformance wrapper rejects timestamps that deviate more than a threshold from the time when the commit request is received. This is important to prevent faulty clients from committing transactions with very large timestamps, which could cause spurious aborts. The conformance wrapper uses the propose value and check value upcalls offered by the BASE library for replicas to agree on the time when the commit request is received. Replicas use the agreed upon value to decide whether to reject or accept the proposed timestamp. This ensures that all correct replicas reach the same decision. Besides maintaining these two data structures and checking timestamps, the wrapper simply invokes the operations exported by the Thor server after calling modify to inform the BASE library of which abstract objects are about to be modified. In the current implementation, the wrapper issues requests to the server one at a time to ensure that replicas agree on the fate of conflicting transactions in concurrent commit requests. We could improve performance by allowing more concurrency. For example, we could perform concurrency control checks and insert transactions into the VQ sequentially in order of increasing sequence number while allowing the rest of the execution to proceed concurrently. 3.2.4 State Conversions. The get obj upcall receives the index of an abstract object and returns a pointer to a buffer containing the current value of that abstract object. The implementation of get obj in this example uses the index to determine which area the abstract object belongs to. Then, it computes the value of the abstract object using the procedure that corresponds to the object’s area: Database pages. If the abstract object is a database page, get obj retrieves a copy of the page from disk (or from the page cache) and applies any pending modifications to the page that are in the MOB. This is the current value of the page that is returned. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 257 Validation queue. If the object represents a validation queue entry, get obj retrieves the timestamp that corresponds to this entry from the VQ array in the conformance representation. Then it uses the timestamp to fetch the entry from the VQ maintained by the server, and copies the sets with orefs of objects read or modified by the transaction to compose the value of the abstract object. Invalid sets. If the object represents an invalid set for a client with number c, get obj uses the client array in the conformance representation to map c to the client data structure maintained by the server for the corresponding client. Then, it retrieves the client invalid set from this data structure and uses it to compose the abstract object value. Cached-pages directory. In this case, get obj determines the pagenum of the requested abstract object by computing the offset to the beginning of the area. Then, it uses the pagenum to look up the information to compose the abstract object value in the cached-pages directory maintained by the server. The put objs upcall receives an array with new values for abstract objects and updates the concrete state to match these values. It iterates over the abstract object values and uses the object indices to determine which of the procedures below to execute. Database pages. To update a concrete database page, put objs removes any modifications in the MOB for that page to ensure that the new page value will not be overwritten with old modifications. Then, it places a page matching the new abstract value in the server’s cache and marks it as dirty. Validation queue, invalid sets and cached-pages directory. If the relevant server data structure already contains an entry corresponding to a new abstract object value, the function just updates the entry according to the new value. Otherwise, it must delete the entry from the server data structure if the new abstract object value describes a nonexistent entry, or create the entry if it did not previously exist and fill in the values according to the new abstract value. The conformance representation is updated accordingly. 4. EVALUATION Our replication technique must achieve two goals to be successful: it must have low overhead, and the code of the conformance wrapper and the state conversion functions must be simple. It is important for the code to be simple to reduce the likelihood of introducing more errors and to keep the monetary cost of using our technique low. This section evaluates the extent to which both example applications meet each of these goals. A detailed performance evaluation of the BFT library appears in Castro and Liskov [2000, 2002] and Castro [2000]. It includes several micro-benchmarks to evaluate performance in the normal case, during view changes, and during state transfer. Additionally, it includes experiments to evaluate the performance impact of the various optimizations used by BFT. Furthermore, these experimental results are backed by an analytic performance model [Castro 2000]. These results are relevant to the evaluation of the BASE library. The difference ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 258 • M. Castro et al. is that BASE adds the cost of conversions between abstract and concrete representations. 4.1 File System Overhead This section presents results of experiments that compare the performance of our replicated file system with the off-the-shelf, unreplicated NFS implementations that it wraps. 4.1.1 Experimental Setup. Our technique has three advantages: reuse of existing code, software rejuvenation using proactive recovery, and opportunistic N-version programming. We present results of experiments to measure the overhead in systems that benefit from different combinations of these advantages. We ran experiments with and without proactive recovery in a homogeneous setup, where all replicas ran the same operating system, and in a heterogeneous setup, where each replica ran a different operating system. All experiments ran with four replicas and one client. Four replicas can tolerate one Byzantine fault; we expect this reliability level to suffice for most applications. Experiments to evaluate the performance of the replication algorithm with more clients and replicas appear in Castro [2000]. In the homogeneous setup, clients and replicas ran on Dell Precision 410 workstations with Linux 2.2.16-3 (uniprocessor). These workstations have a 600 MHz Pentium III processor, 512 MB of memory, and a Quantum Atlas 10K 18WLS disk. All machines were connected by a 100 Mb/s switched Ethernet and had 3Com 3C905B interface cards. The switch was an Extreme Networks Summit48 V4.1. The experiments ran on an isolated network. The heterogeneous setup used the same hardware setup but some replicas ran different operating systems. The client and one of the replicas ran Linux as in the homogeneous setup. The other replicas ran different operating systems: one ran Solaris 8 1/01; another ran OpenBSD 2.8; and the last one ran FreeBSD 4.0. All experiments ran the modified Andrew benchmark [Howard et al. 1988; Ousterhout 1990], which emulates a software development workload. It has five phases: (1) creates subdirectories recursively; (2) copies a source tree; (3) examines the status of all the files in the tree without examining their data; (4) examines every byte of data in all the files; and (5) compiles and links the files. They ran the scaled up version of the benchmark described in Castro and Liskov [2000] where phase 1 and 2 create n copies of the source tree, and the other phases operate in all these copies. We ran a version of Andrew with n equal to 100, Andrew100, that creates approximately 200 MB of data and another with n equal to 500, Andrew500, that creates approximately 1 GB of data. Andrew100 fits in memory at both the client and the replicas but Andrew500 does not. The benchmark ran at the client machine using the standard NFS client implementation in the Linux kernel with the following mount options: UDP transport, 4096-byte read and write buffers, allowing write-back client caching, and allowing attribute caching. All the experiments report the average of three ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 259 Table I. Andrew100: Elapsed Time in Seconds Phase 1 2 3 4 5 Total BASEFS 0.9 49.2 45.4 44.7 287.3 427.65 NFS-std 0.5 27.4 39.2 36.5 234.7 338.3 Table II. Andrew500: Elapsed Time in Seconds Phase 1 2 3 4 5 Total BASEFS 5.0 248.2 231.5 298.5 1545.5 2328.7 NFS-std 2.4 137.6 199.2 238.1 1247.1 1824.4 runs of the benchmark and the standard deviation was always below 7% of the reported values. 4.1.2 Homogeneous Results. Tables I and II present the results for Andrew100 and Andrew500 in the homogeneous setup with no proactive recovery. They compare the performance of our replicated file system, BASEFS, with the standard, unreplicated NFS implementation in Linux with Ext2fs at the server, NFS-std. In these experiments, BASEFS is also implemented on top of a Linux NFS server with Ext2fs at each replica. The results show that the overhead introduced by our replication technique is low: BASEFS takes only 26% longer than NFS-std to run Andrew100 and 28% longer to run Andrew500. The overhead is different for the different phases mostly due to variations in the amount of time the client spends computing between issuing NFS requests. There are two main sources of overhead: the cost of running the Byzantinefault-tolerant replication protocol, and the cost of abstraction. The latter includes the time spent running the conformance wrapper and the time spent running the abstraction function to compute checkpoints of the abstract file system state. We estimate that the Byzantine fault tolerance protocol adds approximately 15% of overhead relative to NFS-std in Andrew100 and 20% in Andrew500. This estimate is based on the overhead of BFS relative to NO-REP for Andrew100 and Andrew500 that was reported in Castro and Liskov [2000]. We expect this estimate to be fairly accurate: BFS is very similar to BASEFS except that it does not use abstraction, and NO-REP is identical to BFS except that it is not replicated. The remaining overhead of 11% relative to NFS-std in Andrew100 and 8% in Andrew500 can be attributed to abstraction. We also ran Andrew100 and Andrew500 with proactive recovery. The results, which are labeled BASEFS-PR, are shown in Table III. The results for Andrew100 were obtained by recovering replicas round robin with a ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 260 • M. Castro et al. Table III. Andrew with Proactive Recovery: Elapsed Time to Run the Benchmark in Seconds System BASEFS-PR BASEFS NFS-std Andrew100 448.2 427.65 338.33 Andrew500 2385.1 2328.7 1824.4 Table IV. Andrew: Maximum Time to Complete a Recovery in Seconds Shutdown Reboot Restart Fetch and check Total Andrew100 0.07 30.05 0.18 18.28 48.58 Andrew500 0.32 30.05 0.97 141.37 172.71 new recovery starting every 80 seconds; reboots were simulated by sleeping 30 seconds.1 We obtained the results for Andrew500 in the same way but in this case a new recovery was started every 250 seconds. This leads to a window of vulnerability of approximately 6 minutes for Andrew100 and 17 minutes for Andrew500; that is, the system will work correctly as long as fewer than 1/3 of the replicas fail in a correlated way within any time window of size 6 (or 17) minutes. (A discussion on windows of vulnerability with proactive recovery appears in [Castro and Liskov 2002].) The results show that even with these very strong guarantees BASEFS is only 32% slower than NFS-std in Andrew100 and 31% slower in Andrew500. Table IV presents a breakdown of the time to complete the slowest recovery in Andrew100 and Andrew500. Shutdown accounts for the time to write the state of the replication library and the conformance representation to disk, and restart is the time to read this information back. Fetch and check is the time to rebuild the oid to file handle mappings in the conformance wrapper, to convert the state stored by the NFS server to its abstract form and check it, and to fetch out-of-date objects from other replicas. Fetching out-of-date objects is done in parallel with converting and checking the state. The recovery time in Andrew100 is dominated by the time to reboot but as the state size increases, reading, converting, and checking the state become the dominant cost; this accounts for 141 seconds in Andrew500 (82% of the total recovery time). Scaling to larger states is an issue but we could use the techniques suggested in Castro and Liskov [2000] that make the cost of checking proportional to the number of objects modified in a time period rather than to the total number of objects in the state. As mentioned, we would like our implementation of proactive recovery to start an NFS server on a second empty disk with a clean file system to improve the range of faults that can be tolerated. We believe that extending our implementation in this way should not significantly affect the performance of 1 This reboot time is based on the results obtained by the LinuxBIOS project [Minnich 2000]. They claim to be able to reboot Linux in 35 s by replacing the BIOS with Linux. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 261 Table V. Andrew100 Heterogeneous: Elapsed Time in Seconds System BASEFS-PR BASEFS OpenBSD Solaris FreeBSD Linux Elapsed Time 1950.6 1662.2 1599.1 1009.2 848.4 338.3 the recovery. We would write each abstract object to the new file system asynchronously right after checking it. Since the value of the abstract object is already in memory at this point and it is written to a different disk, the additional overhead should be minimal. 4.1.3 Heterogeneous Results. Table V presents results for Andrew100 with and without proactive recovery in the heterogeneous setup. In this experiment, each BASEFS replica runs a different operating system with a different NFS and file system implementation. The table also presents results for the standard NFS implementation in each operating system without replication. The overhead of BASEFS in this experiment varies from 4% relative to the slowest replica (OpenBSD) to 391% relative to the fastest replica (Linux). The replica running Linux is much faster than all the others because Linux does not ensure stability of modified data and meta-data before replying to the client as required by the NFS protocol. The overhead relative to OpenBSD is low because BASEFS only requires a quorum with 3 replicas, which must include the primary, to complete operations. These results were obtained with the primary replica in the machine running Linux. Therefore, BASEFS does not need to wait for the slowest replica to complete operations. However, this replica slows down the others because it is out-of-date most of the time and it is constantly transferring state from the others. This partially explains why the overhead relative to the third fastest replica (Solaris) is higher than in the homogeneous case (65% versus 26%). We also ran BASEFS with proactive recoveries in the heterogeneous setup. We recovered a new replica every 425 seconds; reboots were simulated by sleeping 30 seconds. In this case, the overhead varies from 22% relative to the slowest replica to 477% relative to the fastest replica. The overhead of BASEFS-PR relative to BASEFS without proactive recovery is higher in the heterogeneous setup than in the homogeneous setup. This happens because proactive recovery causes the slowest replica to periodically become the primary. During these periods the system must wait for the slowest replica to complete operations (and to get up-to-date before it can complete operations). 4.2 Object-Oriented Database Overhead This section presents results of experiments to measure the overhead of our replicated implementation of Thor relative to the original implementation of ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 262 • M. Castro et al. Fig. 6. OO7: Elapsed time, cold read-only traversals. Thor without replication. To provide a conservative measurement, the version of Thor without replication does not ensure stability of information committed by a transaction. A real implementation would save a transaction log to disk or use replication to ensure stability as we do. In either case, the overhead introduced by BASE would be lower. 4.2.1 Experimental Setup. We ran four replicas and one client in the homogeneous setup described in Section 4.1. The experiments ran the OO7 benchmark [Carey et al. 1993], which is intended to match the characteristics of many different CAD/CAM/CASE applications. The OO7 database contains a tree of assembly objects, with leaves pointing to three composite parts chosen randomly from among 500 such objects. Each composite part contains a graph of atomic parts linked by connection objects; each atomic part has 3 outgoing connections. All our experiments ran on the medium database, which has 200 atomic parts per composite part. The OO7 benchmark defines several database traversals; these perform a depth-first traversal of the assembly tree and execute an operation on the composite parts referenced by the leaves of this tree. Traversals T1 and T6 are read-only; T1 performs a depth-first traversal of the entire composite part graph, while T6 reads only its root atomic part. Traversals T2a and T2b are identical to T1 except that T2a modifies the root atomic part of the graph, while T2b modifies all the atomic parts. We ran each traversal in a single transaction. The objects are clustered into 4 KB pages in the database. The database takes up 38 MB in our implementation. Each server replica had a 20 MB cache (of which 16 MB were used for the MOB); the client cache had 16MB. All the results we report are for cold traversals: the client and server caches were empty in the beginning of the traversals. 4.2.2 OO7 Results. The results in Figure 6 are for read-only traversals. We measured elapsed times for T1 and T6 traversals of the database, both in the original implementation, Thor, and the version that is replicated with BASE, BASE-Thor. The figure shows the total time to run the transaction broken into the time to run the traversal and the time to commit the transaction. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 263 Fig. 7. Elapsed time, cold read-write traversals. BASE-Thor takes 39% more time to complete T1, and 29% more time to complete T6. The commit cost is a small fraction of the total time in these experiments. Therefore, most of the overhead is due to an increase in the cost to fetch pages. The micro-benchmarks in Castro and Liskov [2000] predict an overhead of 60% when fetching 4 KB pages with no computation at the client or the replicas. The overhead here is lower because the pages have to be read from the replicas’ disks. Similarly, the relative overhead is lower for traversal T6 because it generates disk accesses with less locality. Thus, the average time to read a disk page from the server disk is higher in T6 than in T1. We expect a similar effect in more realistic settings where the database does not fit in main memory at either the server or the clients. In these settings, BASE will have lower overhead because the cost of disk accesses will dominate performance. Figure 7 shows elapsed times for read-write traversals. In this case, BASE adds an overhead relative to the original implementation of 38% in T2a and 45% in T2b. The traversal times for T1, T2a, an T2b are almost identical because these traversals are very similar. What is different is the time to commit the transactions. Traversal T2a modifies 500 atomic parts whereas T2b modifies 100000. Therefore, the commit time is a significant fraction of the total time in traversal T2b but not in traversal T2a. BASE increases the commit overhead significantly due to the cost of maintaining checkpoints. The overhead for readwrite traversals would be significantly lower relative to a version of Thor that ensured stability of transaction logs. 4.3 Code Complexity To implement the conformance wrapper and the state conversion functions, it is necessary to write new code. It is important for this code to be simple so that it is easy to write and not likely to introduce new bugs. We measured the number of semicolons in the code we wrote for the replicated file system and for the replicated database to evaluate its complexity. Counting semicolons is better than counting lines because it does not count comment and blank lines. The code we wrote for the replicated file system has a total of 1105 semicolons with 624 in the conformance wrapper and 481 in the state conversion functions. Of the semicolons in the state conversion functions, only 45 are specific ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 264 • M. Castro et al. to proactive recovery. The code of the conformance wrapper is trivial. It has 624 semi-colons only because there are many NFS operations. The code of the state conversion functions is slightly more complex because it involves directory tree traversals with several special cases but it is still rather simple. To put these numbers in perspective, the number of semicolons in the code in the Linux 2.2.16 kernel that is directly related with the file system, NFS, and the driver of our SCSI adapter is 17735. Furthermore, this represents only a small fraction of the total size of the operating system; Linux 2.2.6 has 544442 semicolons including drivers and 229095 semicolons without drivers. The code we wrote for the replicated database has a total of 658 semicolons with 345 in the conformance wrapper and 313 in the state conversion functions. To put these numbers in perspective, the number of semicolons in the original Thor code is 37055. 5. RELATED WORK The BASE library is implemented as an extension to the BFT library [Castro and Liskov 1999, 2000, 2002] and BASEFS is inspired by the BFS file system [Castro and Liskov 1999]. But BFT requires all replicas to run the same service implementation and does not allow reuse of existing code without significant modifications. Our technique for software rejuvenation [Huang et al. 1995] is based on the proactive recovery technique implemented in BFT [Castro and Liskov 2000]. But the use of abstraction allows us to tolerate software errors due to aging that could not be tolerated in BFT, for example, resource leaks in the service code. Additionally, it allows us to combine proactive recovery with N-version Programming. There is a lengthy discussion of work related to BFT in Castro and Liskov [2002], which includes work not only on Byzantine fault tolerance but also on replication in general. Therefore, we omit a discussion of that work from this paper and concentrate on work related to what is new in BASE relative to BFT. N-Version Programming [Chen and Avizienis 1978] exploits design diversity to reduce common mode failures. It works as follows: N software development teams produce different implementations of the same service specification for the same customer; the different implementations are then run in parallel; and voting is used to produce a common result. This technique has been criticized for several reasons [Gray and Siewiorek 1991]: it increases development and maintenance costs by a factor of N or more, and it adds unacceptable time delays to the implementation. In general, this is considered to be a powerful technique, but with limited usability since only a small subset of applications can afford it. BASE enables low cost N-version programming by reusing existing implementations from different vendors. Since each implementation is developed for a large number of customers, there are significant economies of scale that keep the development, testing, and maintenance costs per customer low. Additionally, the cost of writing the conformance wrappers and state conversion functions is kept low by taking advantage of existing interoperability standards. The end result is that our technique will cost less and may actually be more ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 265 effective at reducing common mode failures because competitive pressures will keep implementations from different vendors independent. Recovery of faulty versions has been addressed in the context of N-Version Programming [Romanovsky 2000; Tso and Avizienis 1987] but these approaches have suffered from two problems. First, they are inefficient and cannot scale to services with large state. Second, they require detailed knowledge of each version, which precludes our opportunistic N-Version programming technique. For example, Romanovsky [2000] proposes a technique where each version defines a conversion function from its concrete state to an abstract state. But this abstract state is based on what is common across the implementations of the different versions. Our technique improves on this by providing a very efficient recovery mechanism and by deriving the abstract state from an abstract behavioral specification that captures what is visible to the client succinctly; this leads to better fault tolerance and efficiency. A different way to achieve better fault tolerance through diversity appears in Ramkumar and Strumpen [1997]. This work provides a compiler-assisted mechanism for portable checkpointing that enables recovery of a service in a different machine with a different processor architecture or even a different operating system. This technique uses a single implementation but exploits diversity in an environment using checkpointing/recovery techniques. Several other systems have used wrapping techniques to replicate existing components [Cooper 1985; Liskov et al. 1991; Bressoud and Schneider 1995; Maffeis 1995; Moser et al. 1998; Narasimhan et al. 1999]. Many of these systems have relied on standards like NFS or CORBA [Object Management Group 1999] to simplify wrapping of existing implementations. For example, Eternal [Moser et al. 1998] is a commercial implementation of the new Fault Tolerant CORBA standard [Object Management Group 2000]. All of these systems except Immune [Narasimhan et al. 1999] assume benign faults. There are significant differences between these systems and BASE. First, they assume that replicas run identical implementations. They also assume that replicas are deterministic or they resolve the nondeterminism at a low level of abstraction. For example, many resolve nondeterminism by having a primary run the operations and ship the resulting state or a log with all nondeterministic events to the backups (e.g., Bressoud and Schneider [1995]). Not only does this fail to work with Byzantine faults, but replicas are more likely to fail at the same time because they are forced to behave identically at a low level of abstraction. RNFS [Marzullo and Schmuck 1988] implements a replicated NFS file system from an existing implementation of NFS, and Postgres-R [Kemme and Alonso 2000] and the work in Amir et al. [2002] and Liménez-Peris et al. [2002] implement replicated databases from an existing implementation of Postgres [Stonebraker et al. 1990]. They use group communication toolkits like Isis [Birman et al. 1991] and Ensemble [Hayden 1998] to coordinate wrappers, and wrappers hide observable nondeterminism from the clients rather than forcing deterministic behavior at a low level. They differ from our work because they assume that all replicas run the same implementation, they cannot tolerate Byzantine faults, and they do not provide a proactive recovery mechanism. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 266 • M. Castro et al. BASE uses abstraction to hide most nondeterminism and to enable replicas to run different implementations. It also offers an efficient mechanism for replicas to agree on nondeterministic choices that works with Byzantine faults. This is important when these choices are directly visible by clients, for example, timestamps. Additionally, we provide application-independent support for efficient state transfer and for incremental conversion between abstract and concrete state, which is important because these are harder with Byzantine faults. The work described in Salles et al. [1999] uses wrappers to ensure that an implementation satisfies an abstract specification. These wrappers use the specification to check the correctness of outputs generated by the implementation and contain faults. They are not used to enable replication with different or nondeterministic implementations as in BASE. Since both the examples described in this article relate to storage, it is worthwhile comparing them with RAID [Chen et al. 1994], which is the most widely deployed replicated storage solution. RAID is implemented in hardware in many mother boards and it is cheap. However, BASEFS and BASE-Thor replicate not only the disk but the entire service implementation and operating system. They are more expensive but they may be able to tolerate faults that RAID cannot. For example, errors in the operating system in one replica could cause file system state to become corrupt. BASEFS may tolerate this error if it does not occur in the operating systems of other replicas, but any RAID solution would simply write corrupt information to the replicated disk. 6. CONCLUSION Software errors are a major cause of outages and they are increasingly exploited in malicious attacks to gain control or deny access to important services. Byzantine fault tolerance allows replicated systems to mask some software errors but it has been expensive to deploy. We have described a replication technique, BASE, which uses abstraction to reduce the cost of deploying Byzantine fault tolerance and to improve its ability to withstand attacks and mask software errors. BASE reduces cost because it enables reuse of off-the-shelf service implementations, and it improves resilience to software errors by enabling opportunistic N-version programming, and software rejuvenation through proactive recovery. Opportunistic N-version programming runs distinct, off-the-shelf implementations at each replica to reduce the probability of common mode failures. To apply this technique, it is necessary to define a common abstract behavioral specification for the service and to implement appropriate conversion functions for the state, requests, and replies of each implementation in order to make it behave according to the common specification. These tasks are greatly simplified by basing the common specification on standards for the interoperability of software from different vendors; these standards appear to be common, for example, ODBC [Geiger 1995], and NFS [RFC-1094 1989]. Opportunistic N-version programming improves on previous N-version programming techniques by avoiding the high development, testing, and maintenance costs without compromising the quality of individual versions. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 267 Additionally, we provide a mechanism to repair faulty replicas. Proactive recovery allows the system to remain available provided no more than one-third of the replicas become faulty and corrupt the abstract state (in a correlated way) within a window of vulnerability. Abstraction may enable service availability even when more than one-third of the replicas are faulty because it can hide corrupt items in concrete states of faulty replicas. The paper described BASEFS—a replicated NFS file system implemented using our technique. The conformance wrapper and the state conversion functions in our prototype are simple, which suggests that they are unlikely to introduce new bugs and that the monetary cost of using our technique would be low. We ran the Andrew benchmark to compare the performance of our replicated file system and the off-the-shelf implementations that it reuses. Our performance results indicate that the overhead introduced by our technique is low; BASEFS performs within 32% of the standard NFS implementations that it reuses. We also used the methodology to build a Byzantine fault-tolerant version of the Thor object-oriented database [Liskov et al. 1999] and made similar observations. In this case, the methodology enabled reuse of the existing database code, which is nondeterministic. As future work, it would be interesting to apply the BASE technique to a relational database service by taking advantage of the ODBC standard. Additionally, a library of mappings between abstract and concrete states for common data structures would further simplify our technique. ACKNOWLEDGMENTS We would like to thank Chandrasekhar Boyapati, João Garcia, Ant Rowstron, and Larry Peterson for their helpful comments on drafts of this paper. We also thank Charles Blake, Benjie Chen, Dorothy Curtis, Frank Dabek, Michael Ernst, Kevin Fu, Frans Kaashoek, David Mazières, and Robert Morris for help in providing an infrastructure to run some of the experiments. REFERENCES ADYA, A., GRUBER, R., LISKOV, B., AND MAHESHWARI, U. 1995. Efficient Optimistic Concurrency Control using Loosely Synchronized Clocks. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data. San Jose, California, 23–34. AMIR, Y., DANILOV, C., MISKIN-AMIR, M., STANTON, J., AND TUTU, C. 2002. Practical Wide-Area Database Replication. Tech. Rep. CNDS-2002-1, Johns Hopkins University. BIRMAN, K., SCHIPER, A., AND STEPHENSON, P. 1991. Lightweight causal and atomic group multicast. In ACM Trans. Comput. Syst. Vol. 9(3). BRESSOUD, T. AND SCHNEIDER, F. 1995. Hypervisor-based Fault Tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating System Principles. Copper Mountain Resort, Colorado, 1–11. CALLAGHAN, B. 1999. NFS Illustrated. Addison-Wesley, Reading, Massachusetts. CAREY, M. J., DEWITT, D. J., AND NAUGHTON, J. F. 1993. The OO7 Benchmark. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. Washington D.C., 12– 21. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. 268 • M. Castro et al. CASTRO, M. 2000. Practical Byzantine fault-tolerance. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts. CASTRO, M., ADYA, A., LISKOV, B., AND MYERS, A. 1997. HAC: Hybrid Adaptive Caching for Distributed Storage Systems. In Proceedings of the Sixteenth ACM Symposium on Operating System Principles. Saint Malo, France, 102–115. CASTRO, M. AND LISKOV, B. 1999. Practical Byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation. New Orleans, Louisiana, 173– 186. CASTRO, M. AND LISKOV, B. 2000. Proactive recovery in a Byzantine-fault-tolerant system. In Proceedings of the Fourth Symposium on Operating Systems Design and Implementation. San Diego, California, 273–288. CASTRO, M. AND LISKOV, B. 2002. Practical Byzantine Fault Tolerance and Proactive Recovery. ACM Trans. Comput. Syst. 20, 4 (Nov.), 398–461. CHEN, L. AND AVIZIENIS, A. 1978. N-Version Programming: A Fault-Tolerance Approach to Reliability of Software Operation. In Fault Tolerant Computing, FTCS-8. 3–9. CHEN, P. M., LEE, E. K., GIBSON, G. A., KATZ, R. H., AND PATTERSON, D. A. 1994. RAID: Highperformance, reliable secondary storage. ACM Comput. Surv. 26, 2, 145–185. COOPER, E. 1985. Replicated Distributed Programs. In Proceedings of the Tenth ACM Symposium on Operating System Principles. Orcas Island, Washington, 63–78. GEIGER, K. 1995. Inside ODBC. Microsoft Press. GHEMAWAT, S. 1995. The Modified Object Buffer: a storage management technique for object-oriented databases. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts. GRAY, J. AND SIEWIOREK, D. 1991. High-availability computer systems. IEEE Comput. 24, 9 (Sept.), 39–48. GRAY, J. N. AND REUTER, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc. HAYDEN, M. 1998. The Ensemble System. Tech. Rep. TR98-1662, Cornell University, Ithaca, New York. Jan. HERLIHY, M. P. AND WING, J. M. 1987. Axioms for Concurrent Objects. In Conference Record of the Fourteenth Annual ACM Symposium on Principles of Programming Languages. Munich, Germany, 13–26. HOWARD, J., KAZAR, M., MENEES, S., NICHOLS, D., SATYANARAYANAN, M., SIDEBOTHAM, R., AND WEST, M. 1988. Scale and performance in a distributed file system. ACM Trans. Comput. Syst. 6, 1 (Feb.), 51–81. HUANG, Y., KINTALA, C., KOLETTIS, N., AND FULTON, N. D. 1995. Software rejuvenation: Analysis, modules and applications. In Digest of Papers: FTCS-25, The Twenty-Fifth International Symposium on Fault-Tolerant Computing. Pasadena, California, 381–390. JIMÉNEZ-PERIS, R., PATIÑO-MARTı́NEZ, M., KEMME, B., AND ALONSO, G. 2002. Improving the Scalability of Fault-Tolerant Database Clusters. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS’02). Vienna, Austria. KEMME, B. AND ALONSO, G. 2000. Don’t be lazy be consistent: Postgres-R, a new way to implement Database Replication. In VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases. Cairo, Egypt. LAMPORT, L. 1978. Time, Clocks, and the Ordering of Events in a Distributed System. Comm. ACM 21, 7 (July), 558–565. LISKOV, B., CASTRO, M., SHRIRA, L., AND ADYA, A. 1999. Providing persistent objects in distributed systems. In Proceedings of the 13th European Conference on Object-Oriented Programming (ECOOP ’99). Lisbon, Portugal, 230–257. LISKOV, B., GHEMAWAT, S., GRUBER, R., JOHNSON, P., SHRIRA, L., AND WILLIAMS, M. 1991. Replication in the Harp File System. In Proceedings of the Thirteenth ACM Symposium on Operating System Principles. Pacific Grove, California, 226–238. LISKOV, B. AND GUTTAG, J. 2000. Program Development in Java: Abstraction, Specification, and Object-Oriented Design. Addison-Wesley. MAFFEIS, S. 1995. Adding group communication and fault tolerance to CORBA. In Proceedings of the Second USENIX Conference on Object-Oriented Technologies. Toronto, Canada, 135–146. ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003. BASE: Using Abstraction to Improve Fault Tolerance • 269 MARZULLO, K. AND SCHMUCK, F. 1988. Supplying high availability with a standard network file system. In Proceedings of the 8th International Conference on Distributed Computing Systems. San Jose, California, 447–453. MILLS, D. L. 1992. Network Time Protocol (Version 3) Specification, Implementation and Analysis. Network Working Report RFC 1305. MINNICH, R. 2000. The Linux BIOS Home Page. http://www.acl.lanl.gov/linuxbios. MOSER, L., MELLIAR-SMITH, P., AND NARASIMHAN, P. 1998. Consistent object replication in the eternal system. Theory and Practice of Object Systems 4, 2 (Jan.), 81–92. NARASIMHAN, P., KIHLSTROM, K., MOSER, L., AND MELLIAR-SMITH, P. 1999. Providing Support for Survivable CORBA Applications with the Immune System. In Proceedings of the 19th International Conference on Distributed Computing Systems. Austin, Texas, 507–516. OBJECT MANAGEMENT GROUP. 1999. The Common Object Request Broker: Architecture and Specification. OMG techical committee document formal/98-12-01. June. OBJECT MANAGEMENT GROUP. 2000. Fault Tolerant CORBA. OMG techical committee document orbos/2000-04-04. Mar. OUSTERHOUT, J. 1990. Why Aren’t Operating Systems Getting Faster as Fast as Hardware? In Proceedings of the Usenix Summer 1990 Technical Conference. Anaheim, California, 247–256. PEASE, M., SHOSTAK, R., AND LAMPORT, L. 1980. Reaching Agreement in the Presence of Faults. J. ACM 27, 2 (Apr.), 228–234. RAMKUMAR, B. AND STRUMPEN, V. 1997. Portable checkpointing for heterogeneous architectures. In Digest of Papers: FTCS-27, The Twenty-Seventh Annual International Symposium on FaultTolerant Computing. Seattle, Washington, 58–67. RFC-1014 1987. Network working group request for comments: 1014. XDR: External data representation standard. RFC-1094 1989. Network working group request for comments: 1094. NFS: Network file system protocol specification. RODRIGUES, R. 2001. Combining abstraction with Byzantine fault-tolerance. M.S. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts. ROMANOVSKY, A. 2000. Faulty version recovery in object-oriented N-version programming. IEE Proc. Soft. 147, 3 (June), 81–90. SALLES, F., RODRı́GUEZ, M., FABRE, J., AND ARLAT, J. 1999. MetaKernels and Fault Containment Wrappers. In Digest of Papers: FTCS-29, The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing. Madison, Wisconsin, 22–29. SCHNEIDER, F. 1990. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22, 4 (Dec.), 299–319. STONEBRAKER, M., ROWE, L. A., AND HIROHAMA, M. 1990. The implementation of POSTGRES. IEEE Trans. Knowl. Data Eng. 2, 1 (Mar.), 125–142. TSO, K. AND AVIZIENIS, A. 1987. Community error recovery in N-version software: A design study with experimentation. In Digest of Papers: FTCS-17, the Seventeenth Annual Symposium on Fault Tolerant Computing. Pittsburgh, Pennsylvania, 127–133. W3C. 2000. Extensible Markup Language (XML) 1.0 (Second Edition). W3C recommendation. Received September 2001; revised November 2002; accepted December 2002 ACM Transactions on Computer Systems, Vol. 21, No. 3, August 2003.