Academia.eduAcademia.edu

UpRight Fault Tolerance

2010

List of Tables 2.1 (a) Acceptors required to solve asynchronous consensus under various failure models. c is the maximum number of crash failures and b is the maximum number of Byzantine failures tolerated while ensuring the system is both safe and live. u is the maximum number of failures tolerated while ensuring the system is up. r is the maximum number of commission failures tolerated while ensuring the system is right. (b) Acceptors required to solve asynchronous consensus under the crash (Byzantine) failure model for various values of f = b = c. (c) Acceptors required to solve asynchronous consensus under a hybrid failure model with varying values of b and c. (d) Acceptors required to solve asynchronous consensus under the UpRight model with varying values of u and r. Values representing equivalent configurations across tables are marked with emphasis (italicized for BFT configurations, bolded for CFT configurations, or underlined for HFT configurations). 3.1 Observed peak throughput of BFT systems in a fault-free case and when a single faulty client submits a carefully crafted series of requests. We detail our measurements in Section 3.6.2. † The result reported for Q/U is for correct clients issuing conflicting requests. ‡ The HQ prototype demonstrates fault-free performance and does not implement many of the error-handling steps required to resolve

Copyright by Allen Grogan Clement 2010 The Dissertation Committee for Allen Grogan Clement certifies that this is the approved version of the following dissertation: UpRight Fault Tolerance Committee: Lorenzo Alvisi, Co-Supervisor Mike Dahlin, Co-Supervisor Peter Druschel Michael Walfish Emmett Witchel UpRight Fault Tolerance by Allen Grogan Clement, A.B. Dissertation Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy The University of Texas at Austin December 2010 Acknowledgments As much as this document is “mine,” it would not exist without the support, assistance, and guidance of numerous people. My advisors, Lorenzo Alvisi and Mike Dahlin, have been instrumental in the process of completing this document. Their guidance over the last several years has been invaluable and I would not be the person or researcher that I am today without them. The other members of the thesis committee (Peter Druschel, Michael Walfish, and Emmett Witchel) have exhibited great patience and understanding during the writing process. Their insights and comments on the work have been greatly appreciated and improved the quality of this document. Fellow graduate students make grad school both possible and bearable. Over the past 8 years I’ve had the pleasure of working closely with some great students in the LASR group: especially Amit Aiyer, Manos Kapritsos, Rama Kotla, Sangmin Lee, Harry Li, Prince Mahajan, Mirco Marchetti, J.P. Martin, Jeff Napper, Don Porter, Taylor Riche, Eric Rozner, Chris Rossbach, Srinath Setty, Yang Wang, and Ed Wong. Thank you all for your friendship, advice, help, beers, and patience—I would not have finished without you. Sara Strandtman, the LASR administrator, was very important in getting me out the door. While I may have survived graduate school without her, life was much less stressful knowing that she was there to protect me from the bureaucracy. iv I would like to thank my parents and sisters for putting up with me for all these years. Without you I wouldn’t be here today. Finally, I thank Nathalie for her patience and understanding. While this process has been trying for me, it has probably been more trying for her. I wouldn’t be writing this today without her patience and for that I am eternally grateful. Allen Grogan Clement The University of Texas at Austin December 2010 v UpRight Fault Tolerance Publication No. Allen Grogan Clement, Ph.D. The University of Texas at Austin, 2010 Co-Supervisor: Lorenzo Alvisi Co-Supervisor: Mike Dahlin Experiences with computer systems indicate an inconvenient truth: computers fail and they fail in interesting ways. Although using redundancy to protect against failstop failures is common practice, non-fail-stop computer and network failures occur for a variety of reasons including power outage, disk or memory corruption, NIC malfunction, user error, operating system and application bugs or misconfiguration, and many others. The impact of these failures can be dramatic, ranging from service unavailability to stranding airplane passengers on the runway to companies closing. While high-stakes embedded systems have embraced Byzantine fault tolerant techniques, general purpose computing continues to rely on techniques that are fundamentally crash tolerant. In a general purpose environment, the current best vi practices response to non-fail-stop failures can charitably be described as pragmatic: identify a root cause and add checksums to prevent that error from happening again in the future. Pragmatic responses have proven effective for patching holes and protecting against faults once they have occurred; unfortunately the initial damage has already been done, and it is difficult to say if the patches made to address previous faults will protect against future failures. We posit that an end-to-end solution based on Byzantine fault tolerant (BFT) state machine replication is an efficient and deployable alternative to current ad hoc approaches favored in general purpose computing. The replicated state machine approach ensures that multiple copies of the same deterministic application execute requests in the same order and provides end-to-end assurance that independent transient failures will not lead to unavailability or incorrect responses. An efficient and effective end-to-end solution covers faults that have already been observed as well as failures that have not yet occurred, and it provides structural confidence that developers won’t have to track down yet another failure caused by some unpredicted memory, disk, or network behavior. While the promise of end-to-end failure protection is intriguing, significant technical and practical challenges currently prevent adoption in general purpose computing environments. On the technical side, it is important that end-to-end solutions maintain the performance characteristics of deployed systems: if end-toend solutions dramatically increase computing requirements, dramatically reduce throughput, or dramatically increase latency during normal operation then endto-end techniques are a non-starter. On the practical side, it is important that end-to-end approaches be both comprehensible and easy to incorporate: if the cost of end-to-end solutions is rewriting an application or trusting intricate and arcane protocols, then end-to-end solutions will not be adopted. In this thesis we show that BFT state machine replication can and be used in vii deployed systems. Reaching this goal requires us to address both the technical and practical challenges previously mentioned. We revisiting disparate research results from the last decade and tweak, refine, and revise the core ideas to fit together into a coherent whole. Addressing the practical concerns requires us to simplify the process of incorporating BFT techniques into legacy applications. viii Contents Acknowledgments iv Abstract vi List of Tables xiii List of Figures xv Chapter 1 Introduction 1 Chapter 2 Failure models and fault tolerance 5 2.1 Classifying node and network behaviors . . . . . . . . . . . . . . . . 6 2.1.1 Faulty behaviors . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 Correct behaviors . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Cryptographic assumptions and notation . . . . . . . . . . . 8 2.1.4 Network Behaviors . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Why UpRight? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 3 Robust Performance 17 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Recasting the problem . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Aardvark: RBFT in action . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Protocol description . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4.1 Client request transmission . . . . . . . . . . . . . . . . . . . 26 3.4.2 Replica agreement . . . . . . . . . . . . . . . . . . . . . . . . 31 ix 3.4.3 Primary view changes . . . . . . . . . . . . . . . . . . . . . . 36 3.4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.6 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6.1 Common case performance . . . . . . . . . . . . . . . . . . . 43 3.6.2 Evaluating faulty systems . . . . . . . . . . . . . . . . . . . . 46 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 UpRight RSM Architecture 52 54 4.1 UpRight architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 Division of responsibilities . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1 Library properties . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.2 Application requirements . . . . . . . . . . . . . . . . . . . . 60 Looking forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 Chapter 5 UpRight Stages 5.1 63 Basic stage interactions . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.1.1 Client properties . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1.2 Authentication properties . . . . . . . . . . . . . . . . . . . . 67 5.1.3 Order properties . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1.4 Execution properties . . . . . . . . . . . . . . . . . . . . . . . 70 5.1.5 Putting the stages together . . . . . . . . . . . . . . . . . . . 71 5.2 Network efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Garbage collection and transient crashes . . . . . . . . . . . . . . . . 76 5.3.1 Order stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3.2 Execution stage. . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.3 Authentication stage . . . . . . . . . . . . . . . . . . . . . . . 85 5.3.4 Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Full property list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1 Client Properties . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.2 Authentication stage properties . . . . . . . . . . . . . . . . . 90 5.4.3 Order stage properties . . . . . . . . . . . . . . . . . . . . . . 91 5.4.4 Execution stage properties . . . . . . . . . . . . . . . . . . . . 92 Supported optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 5.5 x 5.6 Messages and notation . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.7 Stage level pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.7.1 Client operation . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.7.2 Authentication operation . . . . . . . . . . . . . . . . . . . . 97 5.7.3 Order operation . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.7.4 Execution operation . . . . . . . . . . . . . . . . . . . . . . . 104 5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 6 UpRight Replication 108 6.1 Consensus background . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 Replicated order stage . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3 6.4 6.2.1 Normal-operation—Zyzzyvark . . . . . . . . . . . . . . . . . . 113 6.2.2 Checkpoint-operation . . . . . . . . . . . . . . . . . . . . . . 120 6.2.3 Interactions with other stages . . . . . . . . . . . . . . . . . . 122 6.2.4 Order stage properties . . . . . . . . . . . . . . . . . . . . . . 123 Replicated execution stage . . . . . . . . . . . . . . . . . . . . . . . . 126 6.3.1 Execution consensus . . . . . . . . . . . . . . . . . . . . . . . 126 6.3.2 Execution-stage checkpoints . . . . . . . . . . . . . . . . . . . 128 6.3.3 Interactions with other stages . . . . . . . . . . . . . . . . . . 133 6.3.4 Execution stage properties . . . . . . . . . . . . . . . . . . . . 135 Replicating authentication stage . . . . . . . . . . . . . . . . . . . . 136 6.4.1 Authentication consensus . . . . . . . . . . . . . . . . . . . . 137 6.4.2 Interactions with other stages . . . . . . . . . . . . . . . . . . 139 6.4.3 Authentication stage properties . . . . . . . . . . . . . . . . . 140 6.5 Implementation and performance . . . . . . . . . . . . . . . . . . . . 141 6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Chapter 7 UpRight Applications 150 7.1 Request Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.2 Checkpoint Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.3 HDFS case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.3.1 Baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . 160 7.3.2 UpRight-HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . 162 xi 7.4 7.5 7.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 7.3.4 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 ZooKeeper case study . . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.4.1 Baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . 168 7.4.2 UpRight-ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . 169 7.4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 173 Chapter 8 Background and state machine replication 175 8.1 RSM approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 8.2 Consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.3 Recent RSM history . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 8.4 Performance with failures . . . . . . . . . . . . . . . . . . . . . . . . 177 8.5 Application fault tolerance . . . . . . . . . . . . . . . . . . . . . . . . 178 Chapter 9 Conclusion 180 Appendix A UpRight Library Byte Specifications 182 A.1 Basic Message Structure . . . . . . . . . . . . . . . . . . . . . . . . . 182 A.2 Inter-stage messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 A.2.1 Message Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 A.2.2 Inter-stage messages . . . . . . . . . . . . . . . . . . . . . . . 186 A.2.3 Order stage checkpoint . . . . . . . . . . . . . . . . . . . . . . 194 A.3 Execution node specifications . . . . . . . . . . . . . . . . . . . . . . 194 A.3.1 Message Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 A.3.2 Execution checkpoints . . . . . . . . . . . . . . . . . . . . . . 196 A.3.3 Execution Messages . . . . . . . . . . . . . . . . . . . . . . . 196 Appendix B UpRight Library API B.1 Client API 200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 B.2 Server API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Bibliography 205 Vita 215 xii List of Tables 2.1 (a) Acceptors required to solve asynchronous consensus under various failure models. c is the maximum number of crash failures and b is the maximum number of Byzantine failures tolerated while ensuring the system is both safe and live. u is the maximum number of failures tolerated while ensuring the system is up. r is the maximum number of commission failures tolerated while ensuring the system is right. (b) Acceptors required to solve asynchronous consensus under the crash (Byzantine) failure model for various values of f = b = c. (c) Acceptors required to solve asynchronous consensus under a hybrid failure model with varying values of b and c. (d) Acceptors required to solve asynchronous consensus under the UpRight model with varying values of u and r. Values representing equivalent configurations across tables are marked with emphasis (italicized for BFT configurations, bolded for CFT configurations, or underlined for HFT configurations). 14 3.1 Observed peak throughput of BFT systems in a fault-free case and when a single faulty client submits a carefully crafted series of requests. We detail our measurements in Section 3.6.2. † The result reported for Q/U is for correct clients issuing conflicting requests. ‡ The HQ prototype demonstrates fault-free performance and does not implement many of the error-handling steps required to resolve inconsistent MACs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 18 Peak throughput of Aardvark and PBFT for different implementation choices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 45 3.3 Observed peak throughput of BFT systems in the fault free case and under heavy client retransmission load. UDP network flooding corresponds to a single faulty client sending 9KB messages. TCP network flooding corresponds to a single faulty client sending requests to open TCP connections and is shown for TCP based systems. . . . 3.4 Throughput during intervals in which the primary delays sending pre-prepare message (or equivalent) by 1, 10, and 100 ms. 3.5 50 . . . . 51 Average throughput for a starved client that is shunned by a faulty primary versus the average per-client throughput for any other client. 51 3.6 Observed peak throughput and observed throughput when one replica floods the network with messages. UDP flooding consists of a replica sending 9KB messages to other replicas rather than following the protocol. TCP flooding consists of a replica repeatedly attempting to open TCP connections on other replicas. . . . . . . . . . . . . . . 5.1 52 Message specification for messages exchanged between stages. The sender and recipients of the messages are indicated. . . . . . . . . . . 95 5.2 Summary of symbols used and their meanings. . . . . . . . . . . . . 95 6.1 Summary of stage-level replication requirements. . . . . . . . . . . . 109 6.2 Consensus semantics for messages related to the order stage. Each proposal or learn message is part of a single consensus instance. The utility messages are used by both consensus protocols. . . . . . . . . 122 6.3 State management messages exchanged between execution replicas. . 129 6.4 Summary of replication requirements for different checkpoint storage strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.5 Inter stage messages and their role in the execution consensus protocol.134 6.6 Inter stage messages related to stage management. . . . . . . . . . . 135 6.7 Messages sent to and from the authentication stage. . . . . . . . . . 139 7.1 Informal statement of application requirements. . . . . . . . . . . . . 151 A.1 Message Tags for all intra-node messages. . . . . . . . . . . . . . . . 186 A.2 Set of messages for intra-node communication . . . . . . . . . . . . . 194 xiv List of Figures 2.1 Different classifications of failure types. (a) represents crash failures. (b) represent omission failures, a superset of crash failures. (c) represents Byzantine, or arbitrary, failures which encompass all behaviors. (d) represents commission failures, the set of Byzantine behaviors that cannot be classified as omission failures. . . . . . . . . . . . . . 7 3.1 Physical network in Aardvark. . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Architecture of a single replica. The replica utilizes a separate NIC for communicating with each other replica and a final NIC to communicate with the collection of clients. Messages from each NIC are placed on separate worker queues. . . . . . . . . . . . . . . . . . . . 25 3.3 Basic communication pattern in Aardvark. . . . . . . . . . . . . . . . 26 3.4 Decision tree followed by replicas while verifying a client request. The narrowing width of the relative volume of client requests that survive each step of the verification process. . . . . . . . . . . . . . . . . . . 3.5 28 Decision tree followed by a replica when handling messages received from another replica. The width of the edges indicates the rate at which messages reach various stages in the processing. . . . . . . . . 3.6 Average per request latency vs. average throughput for Aardvark, HQ, PBFT, Q/U, and Zyzzyva. . . . . . . . . . . . . . . . . . . . . . 3.7 33 44 The latency of an individual client’s requests running Aardvark with 210 total clients. The sporadic jumps represent view changes in the protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 46 CDF of request latencies for 210 clients issuing 100,000 requests with Aardvark servers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 47 4.1 Basic flow of messages in the UpRight architecture. . . . . . . . . . . 57 5.1 Message flow between idealized stages in the UpRight architecture. . 66 5.2 Messages exchanged between stages. (1) Clients send requests to the authentication stage. (2) The authentication stage sends validated request hashes to the order stage. (3) The order stage sends ordered batches to the execution stage. (4a, 4b) The execution stage fetches request bodies from the authentication stage. (4c) The execution stage sends responses to the clients. Note that the messages travel through the system in a clockwise fashion. . . . . . . . . . . . . . . . 5.3 75 Interactions between persistent state at each stage. The state maintained by the other stages depends on the state maintained at the order stage. The order stage maintains one or two checkpoints and between CP interval and 2 × CP interval − 1 ordered batches. The authentication stage maintains every request referenced by an ordered batch stored at the order stage and at most one pending request per client. The execution stage maintains two checkpoints that correspond to order stage checkpoints. Additional details on the contents of the order and execution checkpoints can be found in Figure 5.4 and Figure 5.5 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 Order stage checkpoint. . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5 Execution stage checkpoint. . . . . . . . . . . . . . . . . . . . . . . . 83 5.6 Pseudo-Code for the client . . . . . . . . . . . . . . . . . . . . . . . . 97 5.7 Pseudo-Code for the authentication stage to follow. . . . . . . . . . . 99 5.8 Pseudo-Code for the order stage to follow. . . . . . . . . . . . . . . . 102 5.9 Pseudo-Code for the execution node to follow. . . . . . . . . . . . . . 105 6.1 Basic communication pattern for complete agreement. . . . . . . . . 116 6.2 Basic communication pattern for tentative agreement. . . . . . . . . 117 6.3 Basic communication pattern for speculative agreement. . . . . . . . 117 6.4 Basic communication pattern for the order stage checkpoint consensus protocol. Note that while the execution stage acts as a single proposer, each individual replica is a distinct learner. In the context of the UpRight library, learning is done only when a network or node failure occurs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 xvi 6.5 Execution consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.6 Execution replica pseudo-code related to intra-stage checkpoint and state transfer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.7 Authentication consensus. . . . . . . . . . . . . . . . . . . . . . . . . 138 6.8 Latency v. throughput for J-Zyzzyvark and JSZyzzyvark. . . . . . . 142 6.9 Latency v. throughput for JSZyzzyvark configured for various values of r and u. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.10 Latency v. throughput for JSZyzzyvark configured for various values of r and u with authentication, order, and execution replicas colocated.144 6.11 Jiffies per request. RQ indicates the jiffies at the authentication stage; Order indicates the jiffies at the order stage; Execution indicates the jiffies at the execution stage. . . . . . . . . . . . . . . . . . . . . . . . 144 6.12 JSZyzzyvark performance when using the authentication replica and matrix signatures, standard signatures, and MAC authenticators. (1B requests) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.13 JSZyzzyvark performance for 1B, 1KB, and 10KB requests, and for 1KB and 10KB requests where full requests, rather than digests, are routed through order replicas. . . . . . . . . . . . . . . . . . . . . . . 147 7.1 UpRight application architecture from an application developer perspective. The UpRight library is a black box with a well defined interface. At both the client and the server, the developer implements application-specific glue that connects the library shim to the original application. . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.2 The checkpoint/delta approach for managing application checkpoints. Original application checkpoints are taken infrequently, but the library requests a checkpoint every 100 batches. (a) shows the original application checkpoint taken after executing batch n. (b) shows the checkpoint returned to the replication library after executing batch n + 100. This checkpoint consists of the application checkpoint at n and the log of the next 100 batches. (c) shows the checkpoint returned to the replication library after executing batch n + 200. (d) shows the checkpoint returned to the replication library after executing batch n + 400. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 xvii 7.3 Checkpoint-deltas returned to the application. Each returned checkpointdelta consists of a coarse grained application checkpoint and sufficient deltas to produce the next coarse grained checkpoint. . . . . . . . . 159 7.4 Throughput for HDFS and UpRight-HDFS. . . . . . . . . . . . . . . 165 7.5 CPU consumption (jiffies per GB of data read or written) for HDFS and UpRight-HDFS. . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 7.6 Completion time for requests issued by a single client. In (a), the HDFS NameNode fails and is unable to recover. In (b), a single UpRight-HDFS NameNode fails, and the system continues correctly. 166 7.7 Execution time for TeraGen and TeraSort MapReduce workloads. . . 168 7.8 Throughput for UpRight-ZooKeeper and ZooKeeper for workloads comprising different mixes of 1KB reads and writes. . . . . . . . . . 170 7.9 Per-request CPU consumption for UpRight-ZooKeeper and ZooKeeper for a write-only workload. The y axis is in jiffies. In our system, one jiffy is 4 ms of CPU consumption. . . . . . . . . . . . . . . . . . . . 172 7.10 Performance v. time as machines crash and recover for ZooKeeper and UpRight-ZooKeeper. . . . . . . . . . . . . . . . . . . . . . . . . 173 A.1 Messages are built upon a verified message base. This basis byte structure contains 4 fields: tag, payload size, payload, authentication 183 A.2 Basic byte structure of a message with simple MAC authentication. 184 A.3 Byte definition for a message authenticated with a MAC array. The sender is the replica responsible for generating the MACs, the Digest field is a digest of the tag, payload size, and sender fields. The MACs are generated using the byte representation of the digest rather than the full message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A.4 Message authenticated with a matrix signature. The authentiation block of these messages consists of a collection of MAC Arrays that each authenticate the tag, size and payload. . . . . . . . . . . . . . . 185 A.5 Byte Specification of the Entry at the core of every request. . . . . . 187 A.6 Byte Specification of the payload of a hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµf,o message. . . . . . . . . . . . . . . . . . . . . . . 188 A.7 Byte Specification of the payload of a hcommand, no , c, nc , op, f iµf,e message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 xviii A.8 Byte Specification of a hnext-batch, v, no , H, B, t, bool, oiµ~ o,E message189 A.9 Byte encoding of non-determinism. The two fields correspond to time and a seed for random number generation. . . . . . . . . . . . . . . . 190 A.10 Byte Specification of the hreply, nc , R, H, e, iµe,c message. . . . . . . 190 A.11 Byte Specification of the payload for a hrequest-cp, no , oiµ~ o,E message.190 A.12 Byte Specification of the payload for a hrelease-cp, Tcp , no , oiµ~ o,E message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.13 Byte Specification of the payload for a hretransmit, c, o, µ ~ o,E im essage.191 A.14 Byte Specification of the payload for a hload-cp, Tcp , no , oiµo,e message.191 A.15 Byte specification of a hbatch-complete, v, no , C, eiµ~ e,F message. . 192 A.16 Byte specification of a hfetch, no , c, nc , hash(op), eiµ~ e,F message. . . 192 A.17 Byte specification of a hcp-up, no , C, eiµ~ e,F message. . . . . . . . . . 193 A.18 Byte Specification of hlast-exec, ne , eiµ~ e,O and hcp-loaded, no , eiµ~ e,O messages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 A.19 Byte specification for the payload of a hcp-token, no , Tcp , eiµ~ e,O message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 A.20 Order node checkpoint. . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.21 Order node checkpoint byte specification. . . . . . . . . . . . . . . . 195 A.22 Exec node checkpoint. . . . . . . . . . . . . . . . . . . . . . . . . . . 196 A.23 Order node checkpoint byte specification. . . . . . . . . . . . . . . . 197 A.24 Byte Specification of the payload of a hfetch-exec-cp, n, eiµ~ e,E message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 A.25 Byte Specification of the payload of a hexec-cp-state, n, S, eiµe,e′ message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A.26 Byte Specification of the payload of a hfetch-state, Tstate , eiµ~ e,E message. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 A.27 Byte Specification of the payload of a hstate, Tstate , S, eiµe,e′ message. 199 B.1 Interface exported by the UpRight library to the application client. . 201 B.2 Interface implemented by the application client. . . . . . . . . . . . . 201 B.3 Interface implemented by the application server and called by the UpRight library. The six functions can be considered as three pairs of common functionality: (a) request execution, (b) checkpoint management, and (c) state transfer. . . . . . . . . . . . . . . . . . . . . . 203 xix B.4 Interface exported by the UpRight library to the application server as call-backs. The functions can be considered in groups based on common functionality: (a) response processing, (b) checkpoint management, (c) state transfer, and (d) generic management. . . . . . . 204 xx Chapter 1 Introduction Experiences with computer systems indicate an inconvenient truth: computers fail and they fail in interesting ways. Although using redundancy to protect against fail-stop failures is common practice [12, 19, 39, 44, 89, 108], non-fail-stop computer and network failures occur for a variety of reasons including power outage [51], disk or memory corruption [8, 90, 91], NIC malfunction [2, 21, 96], user error [41, 78], operating system and application bugs [82, 105, 106] or misconfiguration [71, 102], and many others. The impact of these failures can be dramatic, ranging from service unavailability [97] to stranding airplane passengers on the runway [21] to companies closing [14]. While high-stakes embedded systems have adopted Byzantine fault tolerant techniques (e.g., avionics [10, 31, 46]), general purpose computing continues to rely on techniques that are fundamentally crash tolerant. In a general purpose environment, the current best practices response to non-fail-stop failures can charitably be described as pragmatic: identify a root cause and add check-sums to detect the error and prevent it from causing more problems in the future. Pragmatic responses have proven effective for patching holes and protecting against faults once they have occurred; unfortunately the initial damage has already been done, and it is difficult to say if the patches made to address previous faults will protect against future failures. We posit that an end-to-end solution based on Byzantine fault tolerant (BFT) state machine replication is an efficient and deployable alternative to current ad hoc approaches favored in general purpose computing. The replicated state machine 1 approach ensures that multiple copies of the same deterministic application execute requests in the same order and provides end-to-end assurance that independent transient failures will not lead to unavailability or incorrect responses. An efficient and effective end-to-end solution covers faults that have already been observed as well as failures that have not yet occurred, and it provides structural confidence that developers won’t have to track down yet another failure caused by some unpredicted memory, disk, network, or other behavior. While the promise of end-to-end failure protection is intriguing, significant technical and practical challenges currently prevent adoption in general purpose computing environments. On the technical side, it is important that end-to-end solutions maintain the performance characteristics of deployed systems: if end-toend solutions dramatically increase computing requirements, dramatically reduce throughput, or dramatically increase latency during normal operation, then endto-end techniques are not appealing. On the practical side, it is important that end-to-end approaches be both comprehensible and easy to incorporate: if the cost of end-to-end solutions is rewriting an application or trusting intricate and arcane protocols, then end-to-end solutions will not be widely adopted. The goal of this thesis is to make deploying Byzantine fault tolerant systems in a general purpose computing environment easier. To that end, the contributions of this thesis fall into three broad categories. First, we re-define what it means for a system to be fault tolerant. Second, we re-architect a (Byzantine) fault-tolerant library. Third, we re-engineer legacy applications to be Byzantine fault tolerant. • Re-defining the problem. We restate what it means for systems to be fault tolerant in two fundamental ways. First, we embrace the UpRight model for counting failures and designing systems. Second, we advocate the design of fault tolerant systems that are robust to failures, i.e., systems that provide solid performance even when failures occur. Chapter 2 presents the UpRight failure model, an alternative formulation to the traditional Byzantine and crash failure models. The UpRight model has three distinct advantages over traditional failure models. First, it can express the traditional crash [11], Byzantine [79], and hybrid [98] fault models. Second, systems designed under the UpRight failure model provide the specified fault tolerance at the minimal replication cost. Third, designing systems under the 2 UpRight failure model makes the question of providing “Byzantine or crash fault tolerance” a deployment rather than an implementation question; the implementation question becomes whether it is appropriate to provide “fault tolerance or no fault tolerance?” Chapter 3 presents the case for robust fault tolerance and demonstrates that robust (Byzantine) fault tolerant systems are feasible. Fault tolerant systems have traditionally been evaluated based on the throughput provided during failure-free executions and ignored the performance in the presence of failures. A side effect of this evaluation focus has been protocol designs and prototype implementations that can be rendered unusable by a single faulty client or server. We argue that fault tolerant systems should be expected to perform well during failure-ful executions and demonstrate that robust fault tolerant implementations are possible. • Re-architecting BFT. We revisit the design of BFT systems in order to correct, combine, and refine a multitude of ideas that have been developed in the last decade. This portion of the thesis focuses on the design and implementation of the UpRight library. The contribution from this portion of the thesis rests with (a) the specification of responsibilities for the library and the application, (b) the stage-wise description of the steps required for state machine replication, and (c) the use of consensus to fully describe the interactions between nodes in the system. Chapter 4 lays the foundation for the subsequent chapters. Chapter 4 establishes (1) the basic interaction between the UpRight library and a replicated application and (2) a new architecture for state machine replication. The UpRight library delivers a linearized sequence of batches of requests—rather than individual requests delivered by previous systems—to the application for deterministic execution. This subtle shift in the objects delivered to the application provides the application with additional freedom with respect to processing requests. The UpRight architecture divides state machine replication around three core functions—request authentication, request ordering, and request execution—rather than the traditional two (request ordering and request execution) [88]. We present a replication architecture based on separating authentication, order, and execution into three distinct stages and an 3 abstract protocol for coordinating those stages. Chapter 5 details the interactions between the stages identified in Chapter 4. Our work at the stage-level is geared towards providing an end-to-end protocol for correct stages to follow that fulfills the library requirements described in the previous chapter. We ensure that the stage-level protocol provides the appropriate end-to-end properties with correct stages despite faulty clients, an unreliable network with finite bandwidth, finite storage, and transient crashes (i.e. due to temporary power outages). While Chapter 5 focuses on the interaction between stages, Chapter 6 discusses the replicated implementation of each each stage of the UpRight architecture. We base each stage on consensus. Even though our design for the authentication, order, and execution stages are each based on consensus, the protocols implementing each stage require different amounts of replication and different coordination between the replicas. • Re-engineering deployed applications. We demonstrate that BFT replication techniques can be incorporated into existing applications with modest effort and without decimating performance. This work requires us to design the interface between replication libraries and applications to be minimally invasive to the application and also to test that design by integrating the library into deployed applications. Chapter 7 describes the interface between the UpRight library and applications and relates our experience incorporating the UpRight library into the Hadoop distributed file system (HDFS) and Zookeeper distributed coordination service. We take a pragmatic view of the interactions between the library and applications. For example, we prioritize using existing mechanisms, e.g., for checkpoint generation, over highly optimized and generic functionality in the library that may require extensive modification to the application to be useful. We find that we can provide UpRight versions of HDFS and ZooKeeper that offer competitive performance at only nominal development effort. Appendix A describes the byte specification for all messages exchanged and persistent state in our prototype of the UpRight library. Appendix B details the Java interfaces exported to the application client and server by the UpRight library. 4 Chapter 2 Failure models and fault tolerance We want distributed systems to be up (live) and right (safe). Intuitively, a system is up if it processes every received request and right if processed requests are processed correctly. The variety of ways in which things can go wrong, however, makes building systems that are up and right challenging: networks can fail—by delaying, corrupting, or dropping messages—and nodes can misbehave—by failing to take a specified action or taking an arbitrary unspecified action. A fault tolerant system is designed to be up and right despite node and network failures. In this thesis, we target distributed systems that are UpRight; an UpRight system is up (i.e., live) despite up to u Byzantine failures and right (i.e., safe) despite up to r commission failures. This definition of UpRight fault tolerance is layered with jargon and technical terms. To understand the practical implications of UpRight fault tolerance we must first understand the terminology and taxonomy of how computers and the network behave (Section 2.1) and the relationship between UpRight fault tolerance and traditional notions of crash and Byzantine fault tolerance (Section 2.2). We conclude this chapter with a brief discussion of the practical benefits of building systems to provide UpRight fault tolerance (Section 2.3) as opposed to traditional crash [87], Byzantine [61], or hybrid fault tolerance [98]. 5 2.1 Classifying node and network behaviors There is a clear dichotomy between nodes that are correct and nodes that are faulty. Correct nodes always follow a protocol specification faithfully while faulty nodes deviate from the specification in some way. Failures can take different forms, and fault tolerant protocols must be designed under some failure model that defines the failures the protocol is designed to tolerate. The rest of this section explores the definition of such failure models. 2.1.1 Faulty behaviors The simplest type of failures is a crash failure. A replica exhibits a crash failure [87] if it permanently halts. Note that a node that “crashes” and is subsequently rebooted does not exhibit a crash failure because the “crash” is not permanent. A replica that fails to send or receive a subset of messages exhibits a general omission failure [80]. A replica that arbitrarily deviates from its specification exhibits a Byzantine failure [61]. These failure types form a simple hierarchy: every crash failure is an omission failure and every omission failure is a Byzantine failure. The traditional failure hierarchy provides a well-defined classification for every type of failure, but does not provide a convenient label for an important and interesting subset of failures: Byzantine failures that are not omission failures. These failures are called commission failures [72]. Intuitively, a node exhibits a commission failure when it deviates from its specification by taking an unnecessary or incorrect action. This is in contrast with omission failures, which are marked by the failure to take an action. We present a graphical depiction of the relationship between crash, omission, Byzantine, and commission failures in Figures 2.1(a)-(d). Differentiating between omission and commission failures allows us to identify precisely the behaviors that make tolerating Byzantine failures more expensive than tolerating omission failures. 2.1.2 Correct behaviors Correct nodes follow their specification faithfully. Many fault tolerant systems rely on a threshold of correct nodes to ensure correct operation. Fulfilling this expectation can be difficult given the practical reality that a power outage can cause every 6 (a) crash failures (b) omission failures (c) Byzantine failures (d) commission failures Figure 2.1: Different classifications of failure types. (a) represents crash failures. (b) represent omission failures, a superset of crash failures. (c) represents Byzantine, or arbitrary, failures which encompass all behaviors. (d) represents commission failures, the set of Byzantine behaviors that cannot be classified as omission failures. 7 machine in a data center to temporarily crash before power is restored. In theory, machines that exhibit transient crash behavior can be treated as “correct yet slow” and do not impact the safety guarantees provided by the system. In practice, ensuring that nodes remain “correct yet slow” despite transient crashes requires individual nodes to be engineered (a) to commit state to persistent memory before outputting messages onto the network and (b) to restore working state from persistent memory following a transient crash. Note that a node that is not engineered to tolerate transient crashes may be technically guilty of a commission failure if it loses important state while recovering from a transient crash. Consider, for example, a banking service that loses all records of the last ten transactions, including a deposit of $10, 000 into a client’s account, when it crashes due to a power outage. When power is restored and the service resumes operation it will have no record of the deposit and will incorrectly report the balance to be smaller than it should be. In this case the service is guilty of a commission failure—the client believes the transaction occurred but the service does not. 2.1.3 Cryptographic assumptions and notation We assume that cryptographic techniques like collision-resistant hashing, message authentication codes (MACs), encryption, and signatures are secure. In particular, no node d can (a) create hash collisions or (b) forge the signature or MAC of correct node c 6= d. Note that any node can forge the signature or MAC of another node that has shared its authentication credentials; sharing authentication credentials constitutes a commission failure. We denote a message X signed by principal p’s public key as hXiσp . We denote a message X with a MAC appropriate for principals p and r as hXiµp,r ; by standard convention, the order of the nodes indicates that p is the sender and r is the recipient. We denote a message containing a MAC authenticator—an array of MACs appropriate for verification by multiple nodes—as hXiµ~ p or hXiµ~ p,R . The former notation denotes a message authenticated by principal p for verification by every node; the latter denotes a message authenticated by principal p for authentication by nodes in the set R. 8 2.1.4 Network Behaviors In this thesis, we focus on designing systems under the assumption that the network connecting nodes is asynchronous and unreliable. An asynchronous network provides no bound on how long after message is sent by a correct node it is received by the recipient. An unreliable network may arbitrarily reorder, lose, duplicate, or corrupt messages. We define a synchronous interval [18, 33, 53], to be a period in which the network reliably delivers messages with a bounded delay. Definition 1 (Synchronous interval). During a synchronous interval any message sent between correct nodes is delivered within a bounded delay T if the sender retransmits according to some schedule until the message is delivered. We assume that synchronous intervals of arbitrary length occur infinitely often. This assumption is known as eventual synchrony [33]. 2.2 Fault tolerance Fault tolerant systems are designed to be safe and live despite failures. Intuitively, a system is live (aka up) if it provides a response to client requests and is safe (aka right) if all provided responses are correct. The number of nodes required to implement a fault tolerant system depends on the number and types of failures to be tolerated in addition to the targeted safety and liveness properties. The primary focus of this section is exploring four different ways to formulate the number and type of failures that the system tolerates—crash fault tolerance, Byzantine fault tolerance, hybrid fault tolerance, and UpRight fault tolerance. To make the discussion more concrete, we describe the replication requirements for asynchronous consensus [79] protocols under each fault tolerance formulation. A consensus protocol is at the core of every replicated state machine. Consensus. We focus our discussion on Lamport’s formulation of Paxos-style con- sensus [53, 54, 56], which is based on the assignment of each node in the system to at least one of three roles: proposers, acceptors, and learners. Proposers propose values to the system, acceptors coordinate in some way to choose a single proposed 9 value, and learners learn values that have been chosen. In this context, consensus is defined by three safety properties [56]: • Only a value proposed by a proposer can be chosen. • Only a single value is chosen. • Non-faulty learners only learn chosen values. and a single liveness property: • Given a sufficiently long synchronous interval, if a non-faulty proposer proposes a value, then non-faulty learners eventually learn a value. A fault tolerant consensus protocols is safe and live for any number of faulty proposers and learners and a bounded number of faulty acceptors1 Crash fault tolerance. Crash fault tolerant (CFT) protocols are guaranteed to be safe and live despite up to c crash failures. In practice, omission failures and a lossy network are indistinguishable from any participant in the system, so asynchronous CFT systems are effectively safe and live despite up to c omission failures [28]. In general, a total of at least 2c + 1 acceptors are required to implement a CFT consensus protocol that is safe and live despite c crash/omission failures[53, 56]2 . Byzantine fault tolerance. Researchers have developed a multitude of Byzan- tine fault tolerant (BFT) protocols designed to be safe and live despite up to b Byzantine failures [1, 18, 24, 26, 49, 50, 92, 100, 104, 107]. In general, a total of at least 3b + 1 acceptors are required to implement a BFT consensus protocol that is safe and live despite b Byzantine failures[56, 79]3 . Because every omission failure is also a Byzantine failure, Byzantine fault tolerant systems provide protection against a wider variety of failures than crash fault tolerant systems. This makes BFT techniques very powerful and flexible, 1 Note that this claim does not violate the FLP impossibility result [35] due to the inclusion of the “sufficiently long synchronous interval” condition in the statement of liveness. 2 As Lamport notes, there are very specific configurations of proposers, acceptors, and learners that require fewer acceptors. A total of 2c + 1 acceptors is always sufficient to implement CFT consensus. 3 As Lamport notes, there are very specific configurations of proposers, acceptors, and learners that require fewer nodes. A total of 3b+1 acceptors is always sufficient to implement BFT consensus. 10 benefits that can come at a significant cost. Consider, for example, a system running CFT consensus configured to tolerate up to c = 4 crash failures requires 9 = 2 × 4 + 1 acceptors. If it is subsequently discovered that it is important to tolerate one additional failure, a commission failure, the system has to be transitioned to use BFT techniques requiring 16 = 3 × (4 + 1) + 1 acceptors. Hybrid fault tolerance. Hybrid fault tolerance (HFT) [98] is a response to the trepidation over the high cost of transitioning from CFT to BFT techniques and the observation that failures that require BFT are relatively rare. HFT protocols are designed to be safe and live despite up to b Byzantine and c crash failures. A total of at least 3b + 2c + 1 acceptors are required to implement an HFT consensus protocol [98]. Returning to the example above where the system needs to be safe and live despite c = 4 crash failures and b = 1 Byzantine (commission) failure, HFT consensus can be implemented using 12 = 3 × 1 + 2 × 4 + 1 acceptors. UpRight fault tolerance. In this thesis, we advocate UpRight fault tolerance [23], initially described by Lamport [55] and subsequently employed by others [1, 32]. UpRight fault tolerance is motivated by the reality that systems should be up (i.e., live) and right (i.e., safe) despite failures and the recognition that the replication requirements for these two concerns are separate. Under UpRight fault tolerance, systems are designed to be live despite up to u failures of any type and safe despite up to r commission failures. Intuitively, UpRight systems provide the following guarantees: (1) as long as there are at most u failures, the system is guaranteed to respond and (2) as long as there are at most r commission failures, any response is guaranteed to be correct. We note that when u < r, UpRight systems do not guarantee a response when there are between u + 1 and r commission failures, inclusive, but do guarantee that any received response will be correct. The formulation of UpRight fault tolerance can be initially difficult to internalize. As a simple primer, consider a pair of hypothetical distributed systems, one where u = 3 and r = 1 and a second where u = 1 and r = 3. The first system is appropriate for environments where (a) crashes are much more common than commission failures or (b) a higher premium is placed on liveness 11 than on safety. This configuration is guaranteed to provide a response to any request as long as at most three servers are faulty. Further, any response is guaranteed to be correct as long as at most one server is guilty of a commission failure. If two servers have been hacked (an extreme version of a commission failure) and all other servers are correct, then a user is guaranteed to receive a response (u = 3 > 2), but that response is not guaranteed to be correct (r = 1 < 2). If there are four failures then a user is not guaranteed to receive a response; further if at least two of the failures are commission failures then the user is not guaranteed that any received response can be trusted. The second system is appropriate for environments where (a) commission failures are more common than omission failures and/or (b) a higher premium is placed on safety than on liveness. This configuration is guaranteed to provide a response as long as at most one server fails in any way. If multiple servers fail, then a user is not guaranteed a response. However, if at most three servers are guilty of commission failures any response is guaranteed to be correct. In general, a total of 2u + r + 1 acceptors are required to implement an UpRight consensus protocol that is live despite up to u Byzantine failures and safe despite up to r commission failures [32, 56]4 . We now revisit the previous example intended to tolerate four crash failures and one commission failure. In UpRight parlance, the system is expected to be up despite up to u = 4 omission failures and right despite up to r = 1 commission failures. An UpRight fault tolerant consensus protocol can be implemented using 10 = 2 × 4 + 1 × 1 + 1 acceptors. Comparing replication requirements. Table 2.1(a) summarizes the formulas for the minimum number of acceptors required to implement CFT, BFT, HFT, and UpRight consensus protocols. Table 2.1(b) shows the minimum number of acceptors to implement crash (Byzantine) fault tolerant consensus for various values of c (b). Table 2.1(c) shows the minimum number of acceptors required to implement HFT consensus for various values of b and c. Note that the row where b = 0 4 As Lamport notes, there are very specific configurations of proposers, acceptors, and learners that require fewer acceptors. A total of 2u + r + 1 acceptors is always sufficient to implement UpRight consensus. 12 corresponds to crash fault tolerance and the column where c = 0 corresponds to traditional Byzantine fault tolerance. Table 2.1(c) shows the minimum number of acceptors required to implement UpRight consensus for various values of b and c. Note that crash, Byzantine, and hybrid fault tolerance can all be expressed under the UpRight framework. The row where r = 0 is equivalent to configurations that are safe and live despite up to c = u crash failures (aka crash fault tolerance); the diagonal where u = r is equivalent to configurations that are safe and live despite up to b = u = r Byzantine failures (aka Byzantine fault tolerance); the upper right quadrant is equivalent to configurations that are safe and live despite up to b = r Byzantine and c = u + r crash failures (aka hybrid fault tolerance). The end-to-end impact of adapting the UpRight language for fault tolerance is a reduction in the number of acceptors required when compared to BFT and HFT consensus. Intuitively, BFT solutions to consensus require more replicas (3b+1) than UpRight solutions (2u+r +1) because the BFT solutions count every failure against the budgets for both u and r, even if only one of the b total failures is expected to be a commission failure. Similarly, HFT solutions to consensus require more replicas (2c + 3b + 1) than UpRight solutions because the Byzantine portion of the equation (3b) increases both the u and the r portions of the of the replication requirements even though it is needed only because of the commission failures captured by r. 2.3 Why UpRight? The previous sections describe terminology for classifying and counting failures in fault tolerant systems that departs from the customary notions of crash and Byzantine fault tolerance. Although UpRight fault tolerance generalizes crash Byzantine, and hybrid fault tolerance, it is tempting to view the discussion as a theoretical novelty. We believe that UpRight fault tolerance is more than a novelty and is in fact the right framework to use when designing fault tolerant systems. UpRight fault tolerance provides several advantages when compared to other fault tolerance frameworks: • UpRight fault tolerance is flexible and allows system designers and administrators to configure systems with the minimum number of servers. Traditional 13 CFT BFT Hybrid UpRight (a) Replication requirements b\c 0 1 2 3 0 1 4 7 10 f 0 1 2 3 2c + 1 3b + 1 2c + 3b + 1 2u + r + 1 1 3 6 9 12 2 5 8 11 14 CFT 1 3 5 7 BFT 1 4 7 10 (b) CFT and BFT replication 3 7 10 13 16 r\u 0 1 2 3 (c) Hybrid replication 0 1 2 3 4 1 3 4 5 6 2 5 6 7 8 3 7 8 9 10 (d) UpRight replication Table 2.1: (a) Acceptors required to solve asynchronous consensus under various failure models. c is the maximum number of crash failures and b is the maximum number of Byzantine failures tolerated while ensuring the system is both safe and live. u is the maximum number of failures tolerated while ensuring the system is up. r is the maximum number of commission failures tolerated while ensuring the system is right. (b) Acceptors required to solve asynchronous consensus under the crash (Byzantine) failure model for various values of f = b = c. (c) Acceptors required to solve asynchronous consensus under a hybrid failure model with varying values of b and c. (d) Acceptors required to solve asynchronous consensus under the UpRight model with varying values of u and r. Values representing equivalent configurations across tables are marked with emphasis (italicized for BFT configurations, bolded for CFT configurations, or underlined for HFT configurations). 14 fault tolerance constructs are limited: crash fault tolerant systems cannot tolerate commission failures that result from server malfunction while Byzantine and hybrid fault tolerance can unnecessarily increase the replication requirements of the system. • UpRight fault tolerance is a generalization of crash, Byzantine, and hybrid fault tolerance. If crash, Byzantine, or hybrid fault tolerance does accurately capture the design requirements of a specific environment then those requirements can be efficiently expressed using the UpRight framework. Further, if those requirements change then the changes can be accounted within the UpRight framework by adjusting the values of u and r—there is no need to shift from crash to Byzantine or hybrid fault tolerant protocol configuration as the design goals and deployment requirements change. UpRight is the single framework for all of your fault tolerant needs. There are two non-questions that are frequently asked whenever Byzantine fault tolerant systems are discussed. While these questions do not apply to the technical discussion of UpRight fault tolerance, it is important to address them. Do Byzantine failures actually happen? This question is outside the scope of this chapter and this thesis. We claim that fault tolerant systems should be designed to be UpRight and that it is the responsibility of the administrators deploying the system to choose values of u and r that are appropriate for their deployment. Put another way, rather than choosing between crash or Byzantine fault tolerance, system designers should choose UpRight fault tolerance and leave the decision of the type of failures to tolerate to the users of the system. Stronger claims about the frequency and impact of commission failures require extensive deployment and classification and analysis of observed failures. We note that preventing transient crashes from becoming commission failures, as discussed in Section 2.1.2, is non-trivial since it can be difficult to determine when a disk write is really complete [75]. Who cares about fault tolerance, the failure models are wrong because correlated failures do happen? This question is misguided. The failure model describes and classifies the types of failures that can occur. Different fault tolerance 15 criteria express different failure scenarios under which safety and/or liveness are desired. The different failure models discussed in this chapter provide a framework for discussing and designing systems: the system should be live despite up to u failures and safe despite up to r commission failures. In principle, there is no reason not to attempt to build systems that are safe and live despite u = r = n − 1 failures (where n is the total number of servers). Specific problems, i.e. definitions of safety and liveness, may require u and r to be smaller and/or fractions of the total number of servers. Solutions to the consensus problem referred to in this chapter, for example, require n ≥ 2u + r + 1 servers, introducing a failure threshold significantly smaller than the total number of servers in the system. The concern with correlated failures is not connected to the failure model, but rather to the specific techniques employed. The rest of this thesis presents a better way to reason about and design state machine replication; it does not demonstrate that state machine replication is the right approach for solving any specific deployment challenge. 16 Chapter 3 Robust Performance Prelude While the previous chapter focuses on the framework for discussing fault tolerant systems, i.e. how are failures classified and counted, this chapter focuses on what it means for a system to be fault tolerant. Although the discussion is presented in terms of asynchronous Byzantine fault tolerant state machine replication, the conclusion is generally applicable to any fault tolerant system. 3.1 Introduction This chapter is motivated by a simple observation: although recently developed BFT state machine replication protocols have driven the costs of BFT replication to remarkably low levels [1, 18, 26, 49], the reality is that they don’t tolerate Byzantine faults very well. In fact, a single faulty client or server can render these systems effectively unusable by inflicting multiple orders of magnitude reductions in throughput and even long periods of complete unavailability. Performance degradations of such degree are at odds with what one would expect from a system that calls itself Byzantine fault tolerant—after all, if a single fault can render a system unavailable, can that system truly be said to tolerate failures? To illustrate the problem, Table 3.1 shows the measured performance of a variety of systems both in the absence of failures and when a single faulty client submits a carefully crafted series of requests. As we show later, a wide range of other 17 behaviors—faulty primaries, recovering replicas, etc.—can have a similar impact. We believe that these collapses are byproducts of a single-minded focus on designing BFT protocols with ever more impressive best-case performance. While this focus is understandable—after years in which BFT replication was dismissed as too expensive to be practical, it was important to demonstrate that high-performance BFT is not an oxymoron—it has led to protocols whose complexity undermines robustness in two ways: (1) the protocols’ design includes fragile optimizations that allow a faulty client or server to knock the system off the optimized execution path to expensive alternative paths and (2) the protocol implementations often fail to handle properly all of the intricate corner cases, so that the implementations are even more vulnerable than the protocols appear on paper. The primary contribution of this chapter is to advocate a new approach, robust BFT (RBFT), to building BFT systems. Our goal is to change the way BFT systems are designed and implemented by shifting the focus from constructing high-strung systems that maximize best-case performance to constructing systems that offer good and predictable performance under the broadest possible set of circumstances—including when faults occur. In Section 3.2 we elaborate on the need to rethink Byzantine fault tolerance and identify a set of design principles for RBFT systems. In Section 3.3 we present a systematic methodology for designing RBFT systems and an overview of the Aardvark RBFT prototype. In Section 3.4 we describe in detail the important components of the Aardvark protocol. In Section 3.5 we present an analysis System PBFT [18] Q/U [1] HQ [26] Zyzzyva [49] Aardvark Peak Throughput 61.7k 23.8k 7.6k 66k 38.7k Faulty Client 0 0† N/A‡ 0 38.7k Table 3.1: Observed peak throughput of BFT systems in a fault-free case and when a single faulty client submits a carefully crafted series of requests. We detail our measurements in Section 3.6.2. † The result reported for Q/U is for correct clients issuing conflicting requests. ‡ The HQ prototype demonstrates fault-free performance and does not implement many of the error-handling steps required to resolve inconsistent MACs. 18 of Aardvark’s expected performance. In Section 3.6 we present our experimental evaluation. 3.2 Recasting the problem The foundation of modern BFT state machine replication rests on an impossibility result and on two principles that assist us in dealing with it. The impossibility result, of course, is FLP [35], which states that no solution to consensus can be both safe and live in an asynchronous systems if nodes can fail. The two principles, first applied by Lamport to his Paxos protocol [53], are at the core of Castro and Liskov’s seminal work on PBFT [17]. The first states that synchrony must not be needed for safety: as long as a threshold of faulty servers is not exceeded, the replicated service must always produce linearizable executions, independent of whether the network loses, reorders, or arbitrarily delays messages. The second recognizes, given FLP, that synchrony must play a role in liveness: clients are guaranteed to receive replies to their requests only during intervals in which messages sent to correct nodes are received within some fixed (but potentially unknown) time interval from when they are sent. Within these boundaries, the engineering of BFT protocols has embraced Lampson’s well-known recommendation: “Handle normal and worst-case separately as a rule because the requirements for the two are quite different. The normal case must be fast. The worst-case must make some progress” [62]. Ever since PBFT, the design of BFT systems has then followed a predictable pattern: first, characterize what defines the normal (common) case; then, pull out all the stops to make the system perform well for that case. While different systems don’t completely agree on what defines the common-case [42], on one point they are unanimous: the commoncase includes only gracious executions, defined as follows: Definition 2 (Gracious execution). An execution is gracious iff (a) the execution is synchronous with some implementation-dependent short bound on message delay and (b) all clients and servers behave correctly. The results of this approach have been spectacular. In 2007, Zyzzyva reported throughput of over 85,000 null requests per second [49], and subsequent protocols have improved on that mark [42, 93]. 19 Despite these impressive results, we argue that a single minded focus on aggressively tuning BFT systems for the best-case of gracious execution, a practice that we have engaged in with relish [49], is increasingly misguided, dangerous, and even futile. It is misguided, because it encourages the design and implementation of systems that fail to deliver on their basic promise: to tolerate Byzantine faults. While providing impressive throughput during gracious executions, today’s highperformance BFT systems are content to provide weak liveness guarantees (e.g. “eventual progress”) in the presence of Byzantine failures. Unfortunately, as we previewed in Table 3.1 and show in detail in Section 3.6.2, these guarantees are weak indeed. Although current BFT systems can survive Byzantine faults without compromising safety, we contend that a system that can be made completely unavailable by a simple Byzantine failure can hardly be said to tolerate Byzantine faults. It is dangerous, because it encourages fragile optimizations. Fragile optimizations are harmful in two ways. First, as we will see in Section 3.6.2, they make it easier for a faulty client or server to knock the system off its hard-won optimized execution path and enter an alternative, much more expensive one. Second, they weigh down the system with subtle corner-cases, increasing the likelihood of buggy or incomplete implementations. It is (increasingly) futile, because the race to optimize common-case performance has reached a point of diminishing return where many services’ peak demands are already far under the best-case throughput offered by existing BFT replication protocols. For such systems, good enough is good enough, and further improvements in best-case agreement throughput will have little effect on end-to-end system performance. In our view, a BFT system will be most useful if it provides acceptable and dependable performance across the broadest possible set of executions, including executions with Byzantine clients and servers. In particular, the temptation of fragile optimizations should be resisted: a BFT system should be designed around an execution path that has three properties: (1) it provides acceptable performance, (2) it is easy to implement, and (3) it is robust against Byzantine attempts to push the system away from it. Optimizations for the common-case should be accepted only as long as they don’t endanger these properties. 20 FLP tells us that we cannot guarantee liveness in an asynchronous environment. This is no excuse to focus only on performance during gracious executions. In particular, there is no theoretical reason why BFT systems should not be expected to perform well in what we call uncivil executions: Definition 3 (Uncivil execution). An execution is uncivil iff (a) the execution is synchronous with some implementation-dependent bound on message delay, (b) up to f servers and any number of clients are Byzantine, and (c) all remaining clients and servers are correct. Hence, we propose to build RBFT systems that provide adequate performance during uncivil executions. Although we recognize that this approach is likely to reduce the best-case performance, we believe that for a BFT system a limited reduction in peak throughput is usually preferable to the devastating loss of availability that we report in Table 3.1 and Section 3.6.2. Increased robustness may come at effectively no additional cost as long as a service’s peak demand is below the throughput achievable through RBFT design: as a data point, our Aardvark prototype reaches a peak throughput of 38.7k req/s. Similarly, when systems have other bottlenecks, Amdahl’s law limits the impact of changing the performance of agreement. For example, we report in Section 3.6 that PBFT can execute almost 62,000 null requests per second, suggesting that agreement consumes 16.1µs per request. If, rather than a null service, we replicate a service for which executing an average request consumes 100µs of processing time, then peak throughput with PBFT settles to about 8613 requests per second. For the same service, a protocol with twice the agreement overhead of PBFT (i.e., 32.2µs per request), would still achieve peak throughput of about 7564 requests/second: in this hypothetical example, doubling agreement overhead would reduce peak end-to-end throughput by about 12%. 3.3 Aardvark: RBFT in action Aardvark is a new BFT system designed and implemented to be robust to failures. The Aardvark protocol consists of three stages: client request transmission, replica agreement, and primary view change. This is the same basic structure of PBFT [18] and its direct descendants [7, 49, 50, 104, 107], but revisited with the goal of achiev21 ing an execution path that satisfies the properties outlined in the previous section: acceptable performance, ease of implementation, and robustness against Byzantine disruptions. To avoid the pitfalls of fragile optimizations, we focus at each stage of the protocol on how faulty nodes, by varying both the nature and the rate of their actions and omissions, can limit the ability of correct nodes to perform in a timely fashion what the protocol requires of them. This systematic methodology leads us to the three main design differences between Aardvark and previous BFT systems: (1) signed client requests, (2) resource isolation, and (3) regular view changes. Signed client requests. Aardvark clients use digital signatures to authenticate their requests. Digital signatures provide non-repudiation and ensure that all correct replicas make identical decisions about the validity of each client request, eliminating a number of expensive and tricky corner cases found in existing protocols that make use of weaker (though faster) message authentication code (MAC) authenticators [17] to authenticate client requests. The difficulty with utilizing MAC authenticators is that they do not provide the non-repudiation property of digital signatures—one node validating a MAC authenticator does not guarantee that any other nodes will validate that same authenticator [3]. As we mentioned in the introduction to this chapter, digital signatures are generally seen as too expensive to use. Aardvark uses them only for client requests, where it is possible to push the expensive act of generating the signature onto the client while leaving the servers with the less expensive verification operation1 . Server initiated communication—primary-to-replica, replica-to-replica, and replicato-client communication—relies on MAC authenticators. The quorum-driven nature of server-initiated communication ensures that f or fewer2 faulty replicas are unable to force the system into undesirable execution paths. Because of the additional costs associated with verifying signatures in place of MACs, Aardvark must guard against new denial-of-service attacks where the system receives a large numbers of requests with signatures that need to be verified. 1 In developing the Aardvark prototype we explicitly assumed that clients are external entitities that are not controlled by the service provider. In this context, the service provider is not responsible for costs incurred by the clients. In retrospect, this assumption is not appropriate for many deployments and impacts our design of the UpRight library in Chapters 4- 6. 2 Note that we target systems that are safe and live despite up to f = u = r faulty replicas in this chapter. 22 Clients Replica Replica Replica Replica Figure 3.1: Physical network in Aardvark. Our implementation limits the number of signature verifications a client can inflict on the system by (1) utilizing a hybrid MAC-signature construct to put a hard limit on the number of faulty signature verifications a client can inflict on the system and (2) forcing a client to complete one request before issuing the next. Resource isolation. The Aardvark prototype implementation explicitly isolates network and computational resources. As illustrated by Fig. 3.1, Aardvark uses separate network interface controllers (NICs) and wires to connect each pair of replicas. This step prevents a faulty server from interfering with the timely delivery of messages from good servers, as happened when a single broken NIC shut down the immigration system at the Los Angeles International Airport [21]. It also allows a node to defend itself against brute-force denial-of-service attacks by disabling the offending NIC. However, using 23 physically separate NICs for communication between each pair of servers incurs a performance cost, as Aardvark can no longer use ethernet multicast to optimize allto-all communication, and limits the number of replicas in the system to the number of expansion slots on each machine. As Figure 3.2 shows, Aardvark uses separate work queues for processing messages from clients and individual replicas. Employing a separate queue for client requests prevents client traffic from drowning out the replica-to-replica communications required for the system to make progress. Similarly, employing a separate queue for each replica allows Aardvark to schedule message processing fairly, ensuring that a replica is able to gather efficiently the quorums it needs to make progress. Aardvark can also easily leverage separate processors to process incoming client and replica requests. Taking advantage of hardware parallelism allows Aardvark to reclaim part of the costs paid to verify signatures on client requests. We use simple brute-force techniques for resource scheduling. One could consider network-level scheduling techniques rather than distinct NICs in order to isolate network traffic and/or allow rate-limited multicast. Our goal is to make Aardvark as simple as possible, so we leave exploration of these techniques and optimizations for future work. Regular view changes. To prevent a primary from achieving tenure and exerting absolute control on system throughput, Aardvark invokes the view change operation on a regular basis. Replicas monitor the performance of the current primary, slowly raising the required throughput level. If the current primary fails to provide the required throughput, replicas initiate a view change. The key properties of this technique are: 1. During uncivil intervals, system throughput remains high even when replicas are faulty. Since a primary maintains its position only if it achieves some increasing level of throughput, Aardvark bounds throughput degradation caused by a faulty primary by either forcing the primary to be fast or selecting a new primary. When a new primary is selected, the required throughput is reset to an initial threshold, e.g. one half of the previous requirement. 2. As in prior systems, eventual progress is guaranteed when the system is eventually synchronous. 24 Verification Replica Processing NIC Clients NIC Replica NIC Replica NIC Replica Figure 3.2: Architecture of a single replica. The replica utilizes a separate NIC for communicating with each other replica and a final NIC to communicate with the collection of clients. Messages from each NIC are placed on separate worker queues. 25 REQUEST PRE−PREPARE PREPARE COMMIT REPLY 6 C 0 1 2 3 1 2 3 4 5 Figure 3.3: Basic communication pattern in Aardvark. Previous systems have treated view change as an option of last resort that should only be used in desperate situations to avoid letting throughput drop to zero. However, although the phrase “view change” carries connotations of a complex and expensive protocol, in reality the cost of a view change is similar to the regular cost of agreement. Performing view changes regularly introduces short periods of time during which new requests are not being processed, but the benefits of rapidly evicting a misbehaving primary outweigh the periodic costs associated with performing view changes. 3.4 Protocol description Figure 3.3 shows the agreement phase communication pattern that Aardvark shares with PBFT [18]. Variants of this pattern are employed in other recent BFT RSM protocols [1, 26, 42, 49, 93, 104, 107], and we believe that, just as Aardvark illustrates how the RBFT design approach can be applied to PBFT, new RBFT systems based on these other protocols can and should be constructed. We organize the following discussion around the numbered steps of the communication pattern of Figure 3.3. 3.4.1 Client request transmission The fundamental challenge in transmitting client requests is ensuring that, upon receiving a client request, every replica comes to the same conclusion about the authenticity of the request. We ensure this property by having clients sign requests. To guard against denial of service, we break the processing of a client request into a sequence of increasingly expensive steps. Each step serves as a filter, so that 26 more expensive steps are performed less often. For instance, we ask clients to include a MAC on their signed requests and have replicas verify only the signature of those requests whose MAC checks out. As mentioned in Section 3.3, Aardvark explicitly dedicates a single NIC to handling incoming client requests so that incoming client traffic does not interfere with replica-to-replica communication. Protocol Description The steps taken by an Aardvark replica to authenticate a client request follow. 1. Client sends a request to a replica. A client c requests an operation o be performed by the replicated state machine by sending a request message hhrequest, o, s, ciσc , ciµc,p to the replica p it believes to be the primary. If the client does not receive a timely response to that request, then the client retransmits the request hhrequest, o, s, ciσc , ciµc,r to all replicas r. Note that the request contains the client sequence number s and is signed with signature σc . The signed message is then authenticated with a MAC µc,r for the intended recipient. The MAC ensures that the signature cannot be corrupted by an intermediary. Upon receiving a client request, a replica proceeds to verify it by following a sequence of steps designed to limit the maximum load a client can place on a server, as illustrated by Figure 3.4: (a) Blacklist check. If the sender c is not blacklisted, then proceed to step (b). Otherwise discard the message. (b) MAC check. If µc,p is valid, then proceed to step (c). Otherwise discard the message. (c) Sequence check. Compre the sequence number sreq of the most recently cached reply for client c to the sequence number s of the incoming request. If the request sequence number s is exactly scache + 1, then proceed to step (d). Otherwise (c1) Retransmission check. Each replica uses an exponential back off to limit the rate of client reply retransmissions. If a reply has not been sent 27 (a) Blacklist Check fail Discard pass (b) MAC Check fail Discard pass (c) Sequence Check fail (c1) Retransmission Check pass fail fail Discard pass Retransmit Cached Reply (d) Redundancy Check pass (e) Signature Check fail Blacklist Sender Discard pass (f) Once per View Check fail Discard pass Act on Request Figure 3.4: Decision tree followed by replicas while verifying a client request. The narrowing width of the relative volume of client requests that survive each step of the verification process. to c recently, then retransmit the last reply sent to c. Otherwise discard the message. (d) Redundancy check. Examine the most recent cached request from c. If no request from c with sequence number sreq has previously been verified or the request does not match the cached request, then proceed to step (e). Otherwise (the request matches the cached request from c) proceed to step (f). (e) Signature check. If σc is valid, then proceed to step (f); additionally, if the request does not match the previously cached request for sreq , then blacklist c 28 and discard the message. Otherwise if σc is not valid, then blacklist the node x that authenticated µx,p and discard the message. (f) Once-per-view check. If an identical request has been verified in a previous view, but not processed during the current view, then act on the request. Otherwise discard the message. Primary and non-primary replicas act on requests in different ways. A primary adds requests to a pre-prepare message that is part of the three-phase commit protocol described in Section 3.4.2. A non-primary replica r processes a request by authenticating the signed request with a MAC µr,p for the primary p and sending the message to the primary. Note that non-primary replicas will forward each request at most once per view, but they may forward a request multiple times provided that a view change occurs between each occurrence. Note that a request message that is verified as authentic might contain an operation that the replicated service that runs above Aardvark rejects because of an access control list (ACL) or other service-specific security violation. From the point of view of Aardvark, such messages are valid and are delivered to all replicas in the same order. It is the responsibility of the replicated service to handle such messages and security violations, either by rejecting the operation at the service level or generating an application-level error code. A node p only blacklists a sender c of a hhrequest, o, s, ciσc , ciµc,p message if (a) the MAC µc,p is valid but the signature σc is not or (b) the client applies the same sequence number to two distinct requests. A valid MAC is sufficient to ensure that routine message corruption is not the cause of the altered request or invalid signature sent by c, but rather that c has suffered a significant fault or is engaging in malicious behavior. A replica discards all messages it receives from a blacklisted sender and removes the sender from the blacklist after 10 minutes to allow reintegration of repaired machines. Resource scheduling Client requests are necessary to provide input to the RSM while replica-to-replica communication is necessary to process those requests. Aardvark implements separate work queues for receiving client requests and receiving replica-to-replica communication to limit the fraction of replica resources that clients are able to consume, 29 ensuring that a flood of client requests is unable to prevent replicas from making progress on requests already received. Of course, as in a non-BFT service, malicious clients can still deny service to other clients by flooding the network between clients and replicas. Defending against these attacks is an area of active independent research [63, 101]. We deploy our prototype implementation on dual-core machines. As Figure 3.2 shows, one core verifies client requests and the second runs the replica protocol. This explicit assignment allows us to isolate resources and take advantage of parallelism to partially mask the additional costs of signature verification. Discussion RBFT aims at minimizing the costs that faulty clients can impose on replicas. As Figure 3.4 shows, there are four actions triggered by the transmission of a client request that can consume significant replica resources: MAC verification (MAC check), retransmission of a cached reply, signature verification (signature check), and request processing (act on request). The cost a faulty client can cause increases as the request passes each successive check in the verification process, but the rate at which a faulty client can trigger this cost decreases at each step. Starting from the final step of the decision tree, the design ensures that the most expensive message a client can send is a correct request as specified by the protocol, and it limits the rate at which a faulty client can trigger expensive signature checks and request processing to the maximum rate a correct client would. The sequence check step (c) ensures that a client can trigger signature verification or request processing for a new sequence number only after its previous request has been successfully executed. The redundancy check (d) prevents repeated signature verifications for the same sequence number by caching each client’s most recent request. Finally, the once-per-view check (f) permits repeated processing of a request only across different views to ensure progress. The signature check (e) ensures that only requests that will be accepted by all correct replicas are processed. The net result of this filtering is that, for every k correct requests submitted by a client, each replica performs at most k + 1 signature verifications, and any client that imposes a k + 1st signature verification is blacklisted and unable to instigate additional signature verifications until it is removed from the blacklist. 30 Moving up the diagram, a replica responds to retransmission of completed requests paired with valid MACs by retransmitting the most recent reply sent to that client. The retransmission check (c1) imposes an exponential back-off on retransmissions, limiting the rate at which clients can force the replica to retransmit a response. To help a client learn the sequence number it should use, a replica resends the cached reply at this limited rate upon receipt of requests that are from the past and very far in the future. Any request that fails the MAC check (b) is immediately discarded. MAC verifications occur on every incoming message that claims to have the right format unless the sender is blacklisted, in which case the blacklist check (a) results in the message being discarded. The rate of MAC verification operations is thus limited by the rate at which messages purportedly from non-blacklisted clients are pulled off the network, and the fraction of processing wasted is at most the fraction of incoming requests from faulty clients. 3.4.2 Replica agreement Once a request has been transmitted from the client to the current primary, the replicas must agree on the request’s position in the global order of operations. Aardvark replicas coordinate with each other using a standard three-phase-commit protocol [18]. The fundamental challenge in the agreement phase is ensuring that each replica can quickly collect the quorums of prepare and commit messages necessary to make progress. Conditioning expensive operations on the gathering of a quorum of messages makes it easier to ensure robustness in two ways. First, it is possible to design the protocol so that incorrect messages sent by a faulty replica will never gain the support of a quorum of replicas. Second, as long as there exists a quorum of timely correct replicas, a faulty replica that sends correct messages too slowly, or not at all, cannot impede progress. Faulty replicas can introduce overhead also by sending useless message or by sending messages too quickly: to protect themselves, correct replicas in Aardvark process messages from other replicas in a round-robin fashion whenever messages from multiple replicas are available. Not all expensive operations in Aardvark are triggered by a quorum. In particular, a correct replica that has fallen behind its peers may ask them for the 31 state it is missing by sending them a catchup message (see Section 3.4.2). Aardvark replicas defer processing such messages to idle periods. Note that this state-transfer procedure is self-tuning: if the system is unable to make progress because it cannot assemble quorums of prepare and commit messages, then it will become idle and devote more time to processing catchup messages. Agreement protocol The agreement protocol requires replica-to-replica communication. A replica r filters, classifies, and finally acts on the messages it receives from another replica according to the decision tree shown in Figure 3.5: (a) Volume Check. If replica q is sending too many messages, blacklist q and discard the message. Otherwise continue to step (b). Aardvark replicas use a distinct NIC for communicating with each replica. Using per-replica NICs allows an Aardvark replica to silence replicas that flood the network and impose excessive interrupt processing load. In our prototype, we disable a network connection when q’s rate of message transmission in the current view is a factor of 20 higher than for any other replica. After disconnecting q for flooding, r reconnects q after 10 minutes, or when f other replicas are disconnected for flooding. (b) Round-Robin Scheduler. Among the pending messages, select the next message to process from the available messages in round-robin order based on the sending replica ID. Discard received messages when the buffers are full. (c) MAC Check. If the selected message has a valid MAC, then proceed to step (d) otherwise, discard the message. (d) Classify Message. Classify the authenticated message according to its type: • If the message is pre-prepare, then process it immediately in protocol step 3 below. • If the message is prepare or commit, then add it to the appropriate quorum and proceed to step (e). • If the message is a catchup message, then proceed to step (f). 32 (a) Volume Check pass fail Blacklist Sender (b) Round Robin Scheduler (c) MAC Check Discard overflow Discard fail Discard pass (d) Classify Message Nonsense Message Quorum Message Discard Preprepare Message Add to Quorum Status Message not idle Act on Preprepare (f) Idle Check Defer idle (e) Quorum Check Act on Message full Act on Quorum Figure 3.5: Decision tree followed by a replica when handling messages received from another replica. The width of the edges indicates the rate at which messages reach various stages in the processing. • If the message is anything else, then discard the message. (e) Quorum Check. If the quorum to which the message was added is complete, then act as appropriate in protocol steps 4-6 below. (f) Idle Check. If the system has free cycles, then process the catchup message. Otherwise, defer processing until the system is idle. Replica r applies the above steps to each message it receives from the network. Once messages are appropriately filtered and classified, the agreement protocol continues from step 2 of the communication pattern in Figure 3.3. 33 2. Primary forms a pre-prepare message containing a set of valid requests and sends the pre-prepare to all replicas. The primary creates and transmits a hpre-prepare, v, n, hrequest, o, s, ciσc iµ~ p message where v is the current view number, n is the sequence number for the preprepare, and the authenticator is valid for all replicas. Although we show a single request as part of the pre-prepare message, multiple requests can be batched in a single pre-prepare [18, 37, 49, 50]. 3. Replica receives pre-prepare from the primary, authenticates the pre-prepare, and sends a prepare to all other replicas. Upon receipt of hpre-prepare, v, n, hrequest, o, s, ciσc iµ~ p from primary p, replica r verifies the message’s authenticity following a process similar to the one described in Section 3.4.1 for verifying requests. If r has already accepted the preprepare message, r discards the message preemptively. If r has already processed a different pre-prepare message with n′ = n during view v, then r discards the message. If r has not yet processed a pre-prepare message for n during view v, r first checks that the appropriate portion of the MAC authenticator µ ~ p is valid. If the replica has not already done so, it then checks the validity of σc . If the authenticator is not valid r discards the message. If the authenticator is valid and the client signature is invalid, then the replica blacklists the primary and requests a view change. If, on the other hand, the authenticator and signature are both valid, then the replica logs the pre-prepare message and forms a hprepare, v, n, h,riµ~ r to be sent to all other replicas where h is the digest of the set of requests contained in the pre-prepare message. 4. Replica receives 2f prepare messages that are consistent with the pre-prepare message for sequence number n and sends a commit message to all other replicas. Following receipt of 2f matching prepare messages from non-primary replicas r′ that are consistent with a pre-prepare from primary p, replica r sends a hcommit,v, n, riµ~ r message to all replicas. Note that the pre-prepare message from the primary is the 2f + 1st message in the prepare quorum. 5. Replica receives 2f + 1 commit messages, commits and executes the request, and sends a reply message to the client. 34 After receipt of 2f + 1 matching hcommit,v, n, r′ iµ~ r′ from distinct replicas r′ , replica r commits and executes the request before sending hreply, v, u,riµr,c to client c where u is the result of executing the request and v is the current view. 6. The client receives f + 1 matching reply messages and accepts the request as complete. We also support Castro’s tentative execution optimization [18]. The details of tentative execution do not impact the RBFT design and analysis. Catchup messages. State catchup messages are not an intrinsic part of the agree- ment protocol, but they fulfill the important logistical priority of bringing replicas that have fallen behind back up to speed. If replica r receives a catchup message from a replica q that has fallen behind, then r sends q the state that q requires to catch up and resume normal operations. Sending catchup messages is vital to allow temporarily slow replicas to avoid becoming permanently non-responsive, but it also offers faulty replicas the chance to impose significant load on their non-faulty counterparts. Aardvark explicitly delays the processing of catchup messages until there are idle cycles available at a replica—as long as the system is making progress, processing a high volume of requests, there is no need to spend time bringing a slow replica up to speed! Discussion We now discuss the Aardvark agreement protocol through the lens of RBFT, starting from the bottom of Figure 3.5. Because every quorum contains at least a majority of correct replicas, faulty replicas can only marginally alter the rate at which correct replicas take actions (e) that require a quorum of messages. Further, because a correct replica processes catchup messages (f) only when otherwise idle, faulty replicas cannot use catchup messages to interfere with the processing of other messages. When client requests are pending, catchup messages are processed only if too many correct replicas have fallen behind and the processing of quorum messages needed for agreement has stalled—and only until enough correct replicas to enable progress have caught up. Also note that the queue of pending catchup messages is finite, and a replica discards excess catchup messages. If the number of discarded messages exceeds a fixed maximum, then clear the queue of pending catchup messages and 35 reset the discarded message count. A replica processes pre-prepare messages at the rate they are sent by the primary. If a faulty primary sends them too slowly or too quickly, throughput may be reduced, hastening the transition to a new primary as described in Section 3.4.3. Finally, a faulty replica could simply bombard its correct peers with a high volume of messages that are eventually discarded. The round-robin scheduler (b) limits the damage that can result from this attack: if c of its peers have pending messages, then a correct replica wastes at most 1 c of the cycles spent checking MACs and classifying messages on what it receives from any faulty replica. The round-robin scheduler also discards messages that overflow a bounded buffer, and the volume check (a) similarly limits the rate at which a faulty replica can inject messages that the round-robin scheduler will eventually discard. 3.4.3 Primary view changes Employing a primary to order requests enables batching [18, 37] and avoids the need to trust clients to obey a back-off protocol [1, 22]. However, because the primary is responsible for selecting which requests to execute, the system throughput is at most the throughput of the primary. The primary is thus in a unique position to control both overall system progress [4, 7] and fairness to individual clients. The fundamental challenge to safeguarding performance against a faulty primary is that a wide range of primary behaviors can hurt performance. For example, the primary can delay processing requests, discard requests, corrupt clients’ MAC authenticators, introduce gaps in the sequence-number space, unfairly delay or drop some clients’ requests but not others, etc. Hence, rather than designing specific mechanisms to defend against each of these threats, past BFT systems [18, 49] have relied on view changes to replace an unsatisfactory primary with a new, hopefully better, one. Past systems trigger view changes conservatively, only changing views when it becomes apparent that the current primary is unlikely to allow the system to make even minimal progress. Aardvark includes the same view change mechanism and triggers described for PBFT [18]; in conjunction with the agreement protocol, view changes in PBFT are sufficient to ensure eventual progress. They are not, however, sufficient to ensure acceptable progress, so Aardvark adds additional adaptive throughput triggers 36 that can cause a view change when the current throughput is determined to be insufficient. Adaptive throughput Replicas monitor the throughput of the current primary. If a replica judges the primary’s performance to be insufficient, then the replica initiates a view change. More specifically, replicas in Aardvark expect two things from the primary: a regular supply of pre-prepare messages and high sustained throughput. Following the completion of a view change, each replica starts a heartbeat timer that is reset whenever the next valid pre-prepare message is received. If a replica does not receive the next valid pre-prepare message before the heartbeat timer expires, the replica initiates a view change. To ensure eventual progress, a correct replica doubles the heartbeat interval each time the timer expires. Once the timer is reset because a pre-prepare message is received, the replica resets the heartbeat timer back to its initial value. The value of the heartbeat timer is application and environment specific: our implementation uses a heartbeat of 40ms, so that a system that tolerates f failures demands a minimum of 1 pre-prepare every 2f ×40ms during uncivil intervals. The periodic checkpoints that, at pre-determined intervals, correct replicas must take to bound their state offer convenient synchronization points to assess the throughput that the primary is able to deliver. If the observed throughput in the interval between two successive checkpoints falls below a specified threshold, initially 90% of the maximum throughput observed during the previous n views, the replica initiates a view change to replace the current primary. At each checkpoint interval following an initial grace period at the beginning of each view, 5s in our prototype, the required throughput is increased by a factor of 0.01. Continually raising the bar that the current primary must reach in order to stay in power guarantees that a view change will eventually occur and replace the primary, restarting the process with the next primary. Conversely, if the system workload changes, the required throughput adjusts over n views to reflect the performance that a correct primary can provide. Note that every replica decides to initiate a view change independently, so some correct replicas may initiate a view change while others do not. As long as the remaining replicas are satisfied with the current throughput, they can continue 37 processing messages in the current view even though some replicas have stopped processing requests in their desire to join the next view. The combined effect of Aardvark’s new expectations on the primary is that during the first 5s of a view the primary is required to provide throughput of at least 1 request per 40ms or face eviction. The throughput of any view that lasts longer than 5s is at least 90% of the maximum throughput observed during the previous n views. Fairness In addition to hurting overall system throughput, primary replicas can influence which requests are processed. A faulty primary could be unfair to a specific client (or set of clients) by neglecting to order requests from that client. To limit the magnitude of this threat, replicas track fairness of request ordering. When a replica receives from a client a request that it has not seen in a pre-prepare message, it adds the message to its request queue and, before forwarding the request to the primary, it records the sequence number k of the most recent pre-prepare received during the current view. The replica monitors future pre-prepare messages for that request, and if it receives two pre-prepares for another client before receiving a prepare for client c, then it declares the current primary to be unfair and initiates a view change. This ensures that two clients issuing comparable workloads observe throughput values within a constant factor of each other. Discussion The adaptive view change and pre-prepare heartbeats leave a faulty primary with two options: it can provide substandard service and be replaced promptly, or it can remain the primary for an extended period of time and provide service comparable to what a non-faulty primary would provide. A faulty primary that does not make any progress will be caught very quickly by the heartbeat timer and summarily replaced. To avoid being replaced, a faulty primary must issue a steady stream of pre-prepare messages until it reaches a checkpoint interval, when it is going to be replaced until it has provided the required throughput. To do just what is needed to keep ahead of its reckoning for as long as possible, a faulty primary will be forced to deliver 95% of the throughput expected from a correct primary. 38 Periodic view changes may appear to institutionalize overhead, but their cost is actually relatively small. Although the term view change evokes images of substantial restructuring, in reality a view change costs roughly as much as a single instance of agreement with respect to message/protocol complexity: when performed every 100+ requests, periodic view changes have marginal performance impact during gracious or uncivil intervals. We quantify these overheads experimentally in section 3.6. 3.4.4 Implementation new section The Aardvark prototype is based on the December 2007 release of the PBFT code base [9]. Our implementation of Aardvark consists of selected modifications to implement the signed client requests, periodic view changes, and resource scheduling discussed above. We rely on the same basic data structures as PBFT and the UDP-based network communication provided by PBFT [9]. 3.5 Analysis In this section, we analyze the throughput characteristics of Aardvark when the number of client requests is large enough to saturate the system and a fraction g of those requests is correct. We show that Aardvark’s throughput during long enough uncivil executions is within a constant factor of its throughput during gracious executions of the same length provided there are sufficient correct clients to saturate the servers. For simplicity, we restrict our attention to a simplified Aardvark implementation on a single-core machine with a processor speed of κ GHz. We consider only the computational costs of the cryptographic operations—verifying signatures, generating MACs, and verifying MACs, requiring θ, α, and α cycles, respectively3 . Since these operations occur only when a message is sent or received, and the cost of sending or receiving messages is small, we expect similar results when modeling network costs explicitly. We begin by calculating simplified Aardvark’s peak throughput during a gracious view, i.e. a view that occurs during a gracious execution, in Theorem 1. We 3 Note that generating and verifying MACs are symmetric operations and have identical cost α. 39 then show in Theorem 2 that during uncivil views, i.e. views that occur during uncivil executions, with a correct primary simplified Aardvark’s throughput is at least g times the throughput achieved during a gracious view; as long as the primary is correct, faulty replicas are unable to adversely impact simplified Aardvark’s throughput. Finally, we show that the throughput of an uncivil execution is at least the fraction of correct replicas times g times the throughput achieved during a gracious view, in Theorem 3. Note that the latter two theorems are applicable only when the workload remains constant across multiple gracious views. The Aardvark adaptive view change mechanism can require up to n views to converge on the appropriate required throughput for a given workload. We begin in Theorem 1 by computing tpeak , simplified Aardvark’s peak throughput during a gracious view, i.e. a view that occurs during a gracious execution. We then show in Theorem 2 that during uncivil views in which the primary replica is correct, simplified Aardvark’s peak throughput is only reduced to g × tpeak : in other words, ignoring low level network overheads, faulty replicas are unable to curtail simplified Aardvark’s throughput when the primary is correct. Finally, we show in Theorem 3 that the throughput across all views of an uncivil execution is within a constant factor of n−f n × g × tpeak . Theorem 1. Consider a gracious view during which the system is saturated, all requests come from correct clients, and the primary generates batches of requests of size b. Simplified Aardvark’s throughput is then at least second. κ θ+ (4n−2b−4) α b operations per Proof. We examine the actions required by each server to process one batch of size b. For each request in the batch, every server verifies one signature. The primary also verifies one MAC per request. For each batch, the primary generates n − 1 MACs to send the pre-prepare and verifies n − 1 MACs upon receipt of the prepare messages; replicas instead verify one MAC in the primary’s pre-prepare , generate (n − 1) MACs when they send the prepare messages, and verify (n − 2) MACs when they receive them. Finally, each server first sends and then receives n − 1 commit messages, for which it generates and verifies a total of n − 2 MACs, and generates a final MAC for each request in the batch to authenticate the response to the client. The total computational load per request is thus θ + the primary, and θ + (4n+b−4) α b (4n+2b−4) α b at at a replica. The system’s throughput at saturation 40 during a sufficiently long view in a gracious interval is thus at least requests/sec. κ θ+ (4n+2b−4) α b Theorem 2. Consider an uncivil view in which the primary is correct and at most f replicas are Byzantine. Suppose the system is saturated, but only a fraction of the requests received by the primary are correct. The throughput of simplified Aardvark in this uncivil view is within a constant factor of its throughput in a gracious view in which the primary uses the same batch size. Proof. Let θ and α denote the cost of verifying, respectively, a signature and a MAC. We show that if g is the fraction of correct requests then the throughput during uncivil views with a correct primary approaches g of the gracious view’s throughput as the ratio α θ tends to 0. In an uncivil view, faulty clients may send unfaithful requests to every server. Before being able to form a batch of b correct requests, the primary may have to verify b g signatures and MACs, and correct replicas may verify an additional ( gb )(1 b g signatures and − g) MACs. Because a correct server processes messages from other servers in round-robin order, it will process at most two messages from a faulty server per message that it would have processed had the server been correct. The ) total computational load per request is thus g1 (θ+ b(1+g)+4g(n−1+f α) at the primary, b ) α) at a replica. The system’s throughput at saturation during and g1 (θ + b+4g(n−1+f b a sufficiently long view in an uncivil interval with a correct primary thus is at least gκ θ+ (b(1+g)+4g(n−1+f ) α b requests per second: as the ratio α θ tends to 0, the ratio between the uncivil and gracious throughput approaches g. Theorem 3. For sufficiently long uncivil executions and for small f , the throughput of simplified Aardvark, when properly configured, is within a constant factor of its throughput in a gracious execution in which primary replicas use the same batch size and the system is saturated. Proof. First consider the case in which all the uncivil views have correct primary replicas. Assume that in a properly configured simplified Aardvark, tbaseViewTimeout is set so that during an uncivil interval, a view change to a correct primary completes within tbaseViewTimeout . Since a primary’s view lasts at least tgracePeriod , as the ratio α θ tends to 0, the ratio between the throughput during a gracious view and an uncivil t gracePeriod interval approaches g tbaseViewTimeout +tgracePeriod 41 Now consider the general case. If the uncivil interval is long enough, at most f n of its views will have a Byzantine primary. Simplified Aardvark’s heartbeat timer provides two guarantees. First, a Byzantine server that does not produce the throughput that is expected of a correct server will not last as primary for longer than a grace period. Second, a correct server is always retained as a primary for at least the length of a grace period. Furthermore, since the throughput expected of a primary at the beginning of a view is a constant fraction of the maximum throughput achieved by the primary replicas of the last n views, faulty primary replicas cannot arbitrarily lower the throughput expected of a new primary. Finally, since the view change timeout is reset after a view change that results in at least one request being executed in the new view, no view change attempt takes longer then tmaxViewTimeout = 2f tbaseViewTimeout . It follows that, during a sufficiently long uncivil interval, the throughput will be within a factor of of that of Theorem 2, and, as α θ tgracePeriod n−f tmaxViewTimeout +tgracePeriod n tends to 0, the ratio between the throughput during t gracePeriod uncivil and gracious intervals approaches g tmaxViewTimeout +tgracePeriod 3.6 (n−f ) n . Experimental evaluation In evaluating Aardvark we compare the throughput and latency provided by Aardvark and previous replication protocol prototypes (PBFT [18], HQ [26], Q/U [1], and Zyzzyva [49]) during failure-free and failure-ful executions. Our evaluation demonstrates three points: (a) despite our choice to utilize signatures, change views regularly, and forsake IP multicast, Aardvark’s peak throughput is competitive with that of existing systems ; (b) existing systems are vulnerable to significant disruption as a result of a broad range of Byzantine behaviors; and (c) Aardvark is robust to a wide range of Byzantine behaviors. When evaluating existing systems, we attempt to identify places where the prototype implementation departs from the published protocol. Environment We evaluate the performance of Aardvark, PBFT [18, 9], HQ [26], Q/U [1, 83], and Zyzzyva [49] on an Emulab cluster [103] deployed at the University of Texas at Austin. This cluster consists of machines with 3GHz Intel Pentium 4 Xeon processors with hyperthreading, 1GB of memory, and 1 Gb/s Ethernet connections. 42 The code bases used to report our results are provided by the respective systems’ authors. James Cowling provided us the December 2007 public release of the PBFT code base [9] as well as a copy of the HQ codebase. We used version 1.3 of the Q/U codebase, provided to us by Michael Abd-El-Malek in October 2008 [83]. The Zyzzyva codebase is the version used in the SOSP 2007 paper [49]. The Aardvark code is the version used for the NSDI 2009 paper [24]. We rely on the existing configurations for each system to handle f = 1 Byzantine failures. Method. Our basic experimental setup involves correct clients that operate in a closed loop—that is they issue requests one at a time and do not issue request i until they receive a response to request i − 1. Unless otherwise noted, correct clients issue 100k 4KB requests. We increase system load by increasing the number of clients. Clients record the time at which each request is issued and the response received. We calculate the average latency of all requests issued by all clients. We calculate per second throughput by dividing the total duration of the experiment, in seconds, by the total number of requests issued by all clients. 3.6.1 Common case performance In this section we evaluate Aardvark in the absence of failures. We compare the throughput and latency of Aardvark to select previous systems during gracious execution and evaluate the impact of the key differences between Aardvark and previous systems. Failure-free performance. We first measure the throughput and latency of Aardvark and the competing systems in the absence of failures. The results are shown in Figure 3.6. We see that Aardvark’s peak throughput is competitive with that of contemporary state of the art systems. Aardvark’s throughput peaks at 38.7k operations per second, while Zyzzyva and PBFT observe maximum throughputs of 66k and 61.7k operations per second, respectively. The reliance on digital signatures to authenticate client requests increases the per request processing in Aardvark, resulting in increased per request latency and lower throughput. Aardvark, PBFT, and Zyzzyva provide higher throughput than HQ and Q/U because the former set of systems batch requests while the latter two systems process each request individually. 43 7 Aardvark HQ PBFT Q/U Zyzzyva 6 Latency (ms) 5 4 3 2 1 0 0 10 20 30 40 50 60 Throughput (Kops/sec) 70 80 Figure 3.6: Average per request latency vs. average throughput for Aardvark, HQ, PBFT, Q/U, and Zyzzyva. Putting Aardvark together. Aardvark incorporates several key design decisions that enable it to perform well in the presence of Byzantine failure. We study the performance impact of these decisions by measuring the throughput of several variants of PBFT and Aardvark. Each variation corresponds to a piece-wise evolutionary step from PBFT to Aardvark. We measure the peak throughput of each variant by increasing the offered workload until throughput stabilizes. We report the peak throughput of each variant in Table 3.2. While requiring clients in PBFT to sign requests reduces throughput by 50%, we find that the cost of requiring Aardvark clients to use the hybrid MAC-signature scheme imposes a smaller 33% hit to system throughput. Explicitly separating the work queues for client and replica communication makes it easy for Aardvark to utilize the second processor in our test-bed machines, which reduces the throughput costs Aardvark pays to verify signed client requests. This parallelism is the primary source of the 30% improvement we observe between PBFT with signatures and Aardvark. Peak throughput for Aardvark with and without regular view changes is 44 System Aardvark PBFT PBFT w/ client signatures Aardvark without signatures Aardvark without regular view changes Peak Throughput 38.7k 61.7k 31.8k 57.4k 39.8k Table 3.2: Peak throughput of Aardvark and PBFT for different implementation choices. comparable. The reason for this is rather straightforward: when both the new and old primary replicas are non-faulty, a view change requires approximately the same amount of work as a single instance of consensus. Aardvark views led by a non-faulty primary are sufficiently long that the throughput costs associated with performing a view change are negligible. View Changes. We now explore the impact of performing regular view changes on the per request latencies observed by clients. We measure the latencies observed by 210 clients, each issuing 100k requests. Clients are configured to retransmit requests if they do not receive a response within 150ms of issuing the request. Figure 3.7 shows the per-request latency observed by a single client during one run of the experiment. The periodic latency spikes correspond to view changes. When a client issues a request as the view change is initiated, the request is not processed until the request arrives at the new primary following a client timeout and retransmission. In most cases a single client retransmission is sufficient, but additional retransmissions may be required when multiple view changes occur in rapid succession. Figure 3.8 shows the CDF for latencies of all client requests in the same experiment. We observe that a vast majority of requests have latency under 15ms4 , and only a small fraction of all requests incur the higher latencies induced by view changes. 4 Though it is not visible in the graph, we observe that 99.99% fo requests have throughput under 15ms 45 700 600 Latency (ms) 500 400 300 200 100 0 0 1 2 3 4 5 6 7 Request # (x 10000) 8 9 10 Figure 3.7: The latency of an individual client’s requests running Aardvark with 210 total clients. The sporadic jumps represent view changes in the protocol. 3.6.2 Evaluating faulty systems In this section we evaluate Aardvark and existing systems in the context of failures. It is impossible to test every possible Byzantine behavior; consequently we use our knowledge of the systems to construct a set of workloads that we believe to be close to the worst case for Aardvark and other systems. While other faulty behaviors are possible and may stress the evaluated systems in different ways, we believe that our results are indicative of both the vulnerability of existing systems and the robustness of Aardvark. Faulty clients We focus our attention on two aspects of client behavior that have significant impact on system throughput: request dissemination and network flooding. Request dissemination. Table 3.1 (in Section 3.1) depicts the impact of faulty client behavior related to request distribution on the PBFT, HQ, Zyzzyva, and Aard46 1 0.9 0.8 0.7 CDF 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 1 10 Latency (ms) 100 1000 Figure 3.8: CDF of request latencies for 210 clients issuing 100,000 requests with Aardvark servers. vark prototypes. We implement different client behaviors for the different systems to stress test the design decisions the systems have made. In PBFT and Zyzzvya, the clients send requests that are authenticated with MAC authenticators. The faulty client includes an inconsistent authenticator on requests so that request verification will succeed at the primary but fail for all other replicas. When the primary includes the client request in a pre-prepare message, the replicas are unable to verify the request. We developed this workload because, on paper, the protocols specify what appears to be an expensive processing path to handle this contingency. In this situation PBFT specifies a view change while Zyzzyva invokes a conflict resolution procedure that blocks progress and requires replicas to generate signatures. In theory, these procedures should have a noticeable, though finite, impact on performance. In particular, PBFT progress should stall until a timeout forces a new view ([16] pp. 42–43), at which point other clients can make some progress until the faulty client stalls progress again. In Zyzzyva, the servers should pay extra overheads for signatures and view changes. 47 In practice, the throughput of both prototype implementations drops to 0. In Zyzzyva the reconciliation protocol is not fully implemented; in PBFT the client behavior results in repeated view changes, and we have not observed our experiment to finish. While the full PBFT and Zyzzyva protocol specifications guarantee liveness under eventual synchrony, the protocol steps required to handle these cases are evidently sufficiently complex to be difficult to implement, easy to overlook, or both. In HQ, our intended attack is to have clients send certificates during the write-2 phase of the protocol with an inconsistent MAC authenticator. The response specified by the protocol is a signed write-2-refused message which is subsequently used by the client to initiate a call to initiate a request processed by an internal PBFT protocol. This set of circumstances presents a point in the HQ design where a single client, either faulty or simply unlucky, can force the replicas to generate expensive signatures resulting in a degradation in system throughput. We are unable to evaluate the precise impact of this client behavior because the replica processing necessary to handle inconsistent MAC authenticators from clients is not implemented in the HQ prototype. In Q/U during periods of contention when multiple clients issue concurrent requests that modify or depend on overlapping state, replicas are required to perform barrier and commit operations that are rate limited by a client-initiated exponential back-off. During the barrier and commit operations, a faulty client that sends inconsistent certificates to the replicas can theoretically complicate the process further. We implement a simpler scenario in which all clients are correct, yet they issue contending requests to the replicas. In this setting with only 20 clients, the throughput of the Q/U prototype also drops to zero. Q/U’s focus on performance in the absence of both failures and contention makes it especially vulnerable in practice—clients that issue contending requests can decimate system throughput, whether the clients are faulty or not. To avoid corner cases where different replicas make different judgments about the legitimacy of a request, Aardvark clients sign requests. In Aardvark, the closest client behaviors analogous to those discussed above for other systems are sending requests with a valid MAC and invalid signature or sending requests with invalid MACs. We implement both attacks and find the results to be comparable. In Table 3.1 we report the results for requests with invalid MACs. Aardvark does not 48 suffer from a throughput degradation comparable to the ones observed in previous systems because it is able to process the faulty requests efficiently. Requests with an invalid MAC are discarded quickly and do not induce any replica-to-replica communication. Similarly, requests with an invalid signature induce a high one time cost for the primary, but subsequent requests from that client are efficiently discarded. It is important to note that the client in this attack follows the retransmission schedule for a correct client. Our next discussion explores the impact of a clients and servers that aggressively transmit messages. Network flooding. In Table 3.3 we demonstrate the impact of a single faulty client that floods the replicas with messages. During these experiments correct clients issue requests sufficient to saturate each system while a single faulty client implements a brute-force denial-of-service attack by repeatedly sending 9KB UDP messages to the replicas5 . For PBFT and Zyzzyva, 210 clients are sufficient to saturate the servers, while Q/U and HQ are saturated with 30 client processes6 . The PBFT and Zyzzyva prototypes suffer dramatic performance degradation as their incoming network resources are consumed by the flooding client; processing the incoming client requests disrupts the replica-to-replica communication necessary for the systems to make progress. In both cases, the pending client requests eventually overflow queues internal to the server processes resulting in a seg-fault and subsequent crash. Q/U and HQ suffer smaller degradations in throughput from the spamming replicas. The UDP traffic is dropped by the network stack with minimal processing because it does not contain valid TCP packets. The slowdowns observed in Q/U and HQ correspond to the displaced network bandwidth. The reliance on TCP communication in Q/U and HQ changes rather than solves the challenge presented by a flooding client. For example, a single faulty client that repeatedly requests TCP connections crashes both the Q/U and HQ servers. In each of these systems, the vulnerability to network flooding is a byproduct of the prototype implementation and is not fundamental to the protocol design. Network isolation techniques such as those described in Section 3.4 could similarly be applied to these systems. 5 The faulty client is a modified PBFT client instrumented to repeatedly send messages of maximal size, 9KB in the release we evaluate. 6 These client saturation numbers are specific to our experimental machines and closed-loop client construction 49 System PBFT Q/U HQ Zyzzyva Aardvark Peak Throughput 61.7k 23.8k 7.6k 66k 38.7k Network Flooding UDP TCP crash 23.1k crash 4.5k 0 crash 7.8k - Table 3.3: Observed peak throughput of BFT systems in the fault free case and under heavy client retransmission load. UDP network flooding corresponds to a single faulty client sending 9KB messages. TCP network flooding corresponds to a single faulty client sending requests to open TCP connections and is shown for TCP based systems. In the case of Aardvark, the decision to use separate NICs and work queues for client and replica requests ensures that a faulty client is unable to prevent replicas from processing requests that have already entered the system. The throughput observed by Aardvark tracks the fraction of requests that replicas receive that were sent by non-faulty clients. Faulty Primary In systems that rely on a primary, the primary controls the sequence of requests that are processed. In Table 3.4 we show the impact on PBFT, Zyzzyva, and Aardvark prototypes of a primary that delays sending pre-prepare messages by 1, 10, or 100 ms. The throughput of both PBFT and Zyzzyva degrades dramatically as the slow primary is not slow enough to trigger their view change conditions. This throughput degradation is a consequence of the protocol design and specification of when view changes should occur. With an extremely slow primary, Zyzzyva eventually succumbs to a memory leak exacerbated by holding on to requests for an extended period of time. The throughput achieved by Aardvark indicates that adaptively performing view changes in response to observed throughput is a good technique for ensuring performance. In addition to controlling the rate at which requests are inserted into the system, the primary is also responsible for controlling which requests are inserted into the system. We evaluate this impact by instrumenting a single replica to defer 50 System PBFT Zyzzyva Aardvark Peak Throughput 61.7k 66k 38.7k 1 ms 5k 27.8k 38.5k 10 ms 4.9k 5k 37.3k 100 ms 1.1k crash 37.9k Table 3.4: Throughput during intervals in which the primary delays sending preprepare message (or equivalent) by 1, 10, and 100 ms. System PBFT Zyzzyva Aardvark Starved Throughput 1.25 0 358 Normal Throughput 1.5k 1.7k 465 Table 3.5: Average throughput for a starved client that is shunned by a faulty primary versus the average per-client throughput for any other client. processing request from a specified client and report the throughput observed for the shunned client and average throughput for remaining clients. Table 3.5 depicts the results of this experiment. In the case of PBFT and Aardvark, the primary sends a pre-prepare for the targeted client’s request only after receiving the request 9 times. This heuristic prevents the PBFT primary from triggering a view change and demonstrates dramatic degradation in throughput for the targeted client in comparison to the other clients in the system. For Zyzzyva, the unfair primary ignores messages from the targeted client entirely. The resulting throughput is 0 because the implementation is incomplete, and replicas in the Zyzzyva prototype do not forward received requests to the primary as specified by the protocol. Aardvark’s fairness detection and periodic view changes limit the impact of the unfair primary. Non-Primary Replicas We implement a faulty replica that does not process protocol messages and blasts network traffic at the other replicas instead. We report the results of running the systems with the blasting replica in Table 3.6. In the first experiments, a faulty replica blasts 9KB UDP messages at the other replicas7 . The PBFT and Zyzzyva 7 The PBFT prototype uses UDP for inter-server communication. The faulty client is implemented with a PBFT client that sends well-formed, but non-sensical, request messages. 51 System PBFT Q/U HQ Zyzzyva Aardvark Peak Throughput 61.7k 23.8k 7.6k 66k 38.7k Replica Flooding UDP TCP 251 19.3k crash crash crash 0 11.7k - Table 3.6: Observed peak throughput and observed throughput when one replica floods the network with messages. UDP flooding consists of a replica sending 9KB messages to other replicas rather than following the protocol. TCP flooding consists of a replica repeatedly attempting to open TCP connections on other replicas. prototypes again show very low performance as the incoming traffic from the spamming replica displaces much of the legitimate traffic in the system, denying the system both requests from the clients and also replica messages required to make progress. Aardvark’s use of separate worker queues ensures that the replicas process the messages necessary to make progress. In the second experiment, the faulty Q/U and HQ replicas again open TCP connections, consuming all of the incoming connections on the other replicas and denying the clients access to the service. Once again, the shortcomings of the systems are a byproduct of implementation and not protocol design. We speculate that improved network isolation techniques would make the systems more robust. 3.7 Conclusion We claim that high-assurance systems require BFT protocols that are more robust to failures than existing systems. Specifically, BFT protocols suitable for highassurance systems must provide adequate throughput during uncivil intervals in which the network is well behaved but an unknown number of clients and up to f servers are faulty. We present Aardvark, the first BFT state machine protocol designed and implemented to provide good performance in the presence of Byzantine faults. Aardvark sacrifices peak throughput during gracious executions in order to gain significant improvement in performance during uncivil executions. This chapter contains two important contributions. The first contribution is the design, presentation, and evaluation of the Aardvark prototype. The second, 52 and most important, contribution is the observation that fault tolerant systems should be robust to failures—it is not enough to ensure safety and eventual liveness if, during a synchronous interval, a faulty node can reduce system throughput to unacceptably low levels. This simple observation is most notable for its absence in the discussion of previous systems [1, 18, 26, 45, 50, 49, 67, 86, 92, 104, 107]. While the discussion in this chapter has focused on the specific example of asynchronous BFT RSM protocols, we believe that robust performance is an important goal for all fault tolerant systems. 53 Chapter 4 UpRight RSM Architecture State machine replication is a powerful technique for building reliable services from faulty components [52, 88]. The basic approach is simple: convert an application to a deterministic state machine, replicate the state machine, and ensure that each replica executes the same set of requests in the same order. Clients then gather votes from multiple replicas to determine the correct response to deliver to the user. The ultimate goal of an application built on top of an RSM protocol is to ensure that the set of responses received by users of the replicated service are indistinguishable from a set of responses that could have been generated by a single correct server given the same set of requests. Indeed, the systems described in the previous chapter (and many others [1, 12, 18, 24, 26, 49, 50, 67, 92, 104, 108]) are based on state machine replication. These libraries typically specify a linearized order of requests and require replicated applications to execute the requests in the specified linearized order. Initial work towards accommodating parallel execution has focused on requiring the application to execute requests so that the responses and resulting state are equivalent to executing the requests in the specified order [50]. The rest of this thesis focuses on the design, implementation, and use of the UpRight library for state machine replication. The UpRight library is a new library for state machine replication that differs from previous fault tolerant RSM libraries in three important ways. First, the UpRight library is the first RSM library designed to provide (a) UpRight fault tolerance (as opposed to the customary crash[12, 53, 67, 108] or 54 Byzantine [18, 24, 26, 49, 50, 92, 104] fault tolerance) and (b) robust performance from the outset. While statements of fault tolerance equivalent to UpRight fault tolerance have been described before [1, 32, 55], these statements have not resulted in systems designed to leverage the flexibility and low replication costs of UpRight fault tolerance. Similarly, although the Aardvark prototype discussed in Chapter 3 instantiates RBFT, it is accurate to say that robustness was shoe-horned onto an existing library (PBFT [18]). Starting the design with robustness and UpRight tolerance as initial goals allows for a clean end-to-end design. Chapters 2 and 3 provide the foundation for understanding UpRight and robust fault tolerance. Second, the architecture of the UpRight library is based on three distinct stages corresponding to the key steps of state machine replication: request authentication, agreement on an execution order, and deterministic execution of the specified order1 . The latter two stages are standard and form the core of Schneider’s definition of state machine replication [89]. The distinction between these two stages has been leveraged in previous system designs [60, 104, 107] in order to reduce the required number of replicas2 . The request authentication stage is a new stage introduced for the UpRight library as a direct result of our experience the Aardvark prototype and robust fault tolerance (Chapter 3). The three stages of the UpRight architecture have distinct replication requirements, and designing the library around the three stages facilitates efficient usage of computer resources in deployed systems. Third, the UpRight library refines the responsibilities and expectations of the library and replicated applications. The refined responsibilities are reflected in two key differences between API exposed by the UpRight library and the API of previous libraries. First, the UpRight library requires a replicated application to deterministically execute batches of requests in a linearized order; in contrast, previous libraries [18, 24, 49, 50, 92, 104, 107] have required replicated applications to deterministically execute a linearized order of requests. This change in semantics exposes the efficiency-driven internal batching performed by libraries [18, 24, 49, 50, 92, 104, 107] to the application and explicitly emphasizes the possibility of executing non-conflicting requests within the batch in parallel, providing a solid basis for 1 The client side of the library can be considered a fourth stage. It is important to note that most existing systems combine agreement and execution into a single stage [1, 18, 24, 26, 49, 50, 53, 92]. 2 55 leveraging parallel hardware and resources. Second, the UpRight library requires a replicated application to deterministically produce checkpoints on demand; previous replication libraries have taken the checkpoint for the application. While generating deterministic checkpoints on demand initially seems like an extra burden for the application programmer, we believe that it is actually simpler than current techniques requiring the application to be rewritten to support a memory model defined by the replication library. The rest of this chapter is organized as follows. Section 4.1 provides an overview of the UpRight architecture. Section 4.2 describes the responsibilities of the UpRight library and replicated applications in more detail. Section 4.3 previews the subsequent chapters and explains the relationship between those chapters to the ideas discussed in this chapter. 4.1 UpRight architecture There are a multitude of design decisions that go into building systems. Many of these decisions, i.e. using MACs rather than digital signatures to authenticate messages, appear straightforward; it is easy to hope that they can be introduced through local pinhole changes. The reality is that these “small” design decisions can have wide-reaching impact on the end-to-end system. For example, the decision to replace digital signatures with MACs improves performance (i.e. reduces latency and increases throughput) in the common case, but introduces an expensive corner case that can lead to significant performance degradation as shown in Section 3.6.2. We design the UpRight library to replicate the servers, using state machine replication, in client-server systems. The “client” portion of the system consists of the client-side application (aka user) and a library client. The “server” portion of the system consists of the original application server and the library components used to coordinate multiple replicas of the application server. Figure 4.1 provides a graphical depiction of a client-server system implemented with the UpRight library. The client portion of the UpRight library is responsible for interfacing between the application-level user code and the replicated server. The server portion of the UpRight library consists of three distinct stages: (1) request authentication stage, (2) request ordering stage, and (3) request execution stage. Each stage fulfills a specific function: the authentication stage in56 Authentication Valid Request Request User Issue Request Client Order Ordered Request Client Response Execution UpRight Library Execute Request Application Server Figure 4.1: Basic flow of messages in the UpRight architecture. sures that client requests are valid, the order stage places valid requests in batches and orders the batches, and the execution stage delivers batches to the application and relays application responses to clients. Separating the server side of the UpRight library into three distinct stages allows us to provide clean solutions to problems that arise as we replicate the server and also to address shortcomings in previous replicated systems. Separating the authentication stage allows us to (a) authenticate client requests at low cost in both the average and worst cases (i.e. avoid the dangers of faulty clients (Section 3.6.2) without relying on public key cryptography) and (b) minimize the overall network bandwidth and costs associated with ordering requests. Separating order and execution allows us to (a) reduce the overheads of ordering and (b) reduce the total computation in the system by replicating each stage the minimum amount required for that stage rather than the maximum replication required for any stage3 . 3 The benefits of separating order from execution have been noted by others [60, 104, 107]. We 57 At a high level, the protocol is simple: clients send requests to the authentication stage; the authentication stage authenticates the requests; the order stage assigns each authenticated request to a batch and assigns an execution order to the batches; the execution stage executes the batches of requests (by delivering them to the application) in the specified order and reports the results back to the clients. Of course, the reality is more complex than this high level view implies, as various factors conspire to complicate the design—individual nodes can fail in unpredictable ways, the network may not be reliable, and computational/storage/network resources are finite. We will discuss the interaction between stages and challenges associated with replicating each stage for reliability in subsequent chapters. This description also glosses over the interactions between the library and the application. We explore that set of interactions in the next section with a specific focus on the properties that the library and application are required to uphold. 4.2 Division of responsibilities The previous section describes the internal architecture of the UpRight library. In this section, we focus on the contract between the UpRight library and replicated applications. Section 4.2.1 details the responsibilities of the UpRight library. Section 4.2.2 details the responsibilities of replicated applications. 4.2.1 Library properties The UpRight library delivers a linearized sequence of batches of one or more requests to the application. In addition to guaranteeing that each application replica receives the same sequence of request batches, the UpRight library ensures that only authorized requests are included in the ordered batches and that the batches themselves are well-formed. The key difference between the properties of the UpRight library and previous libraries is that the UpRight library defines a linearized order for batch execution while previous libraries define a linearized order for request execution. In otherwords, the Upright library defines a partial order for request execution rather than the total order defined by previous systems. take care to address technical challenges related to checkpoint coordination overlooked in previous efforts to leverage that separation. 58 Before specifying the properties provided by the UpRight library, we first describe some basic notation. A batch of requests is identified by an identifier no . A batch contains a set of one or more requests each issued by an authorized client c, an associated pseudo-random number generator (PRNG) seed, and an associated time t. Requests from client c are differentiated by a request identifier nc and each request is placed in at most one batch. Batch no is well-formed if it contains at most one request per authorized client c. The UpRight library provides the following safety properties: LS1 Only responses generated by the application are delivered to non-faulty users. LS2 Only non-empty batches are delivered to the application. LS3 Batch no is only delivered to the application if the previously delivered batch is no − 1. LS4 Request nc issued via client c is placed in at most one batch. LS5 If requests nc and n′c issued via client c are in batches no and n′o respectively, then nc < n′c iff no < n′o . LS6 Only requests issued to an authorized client are placed in a batch4 . LS7 Given batches no and n′o and associated times t and t′ , no < n′o → t < t′ . The UpRight library provides two distinct liveness properties. First, any request issued by an authorized (and correct) client is delivered to the application. Second, any response generated by the application is delivered to the user that issued the request. LL1 Any request issued via a correct client is eventually delivered to the application. LL2 Any application-produced response to a request issued via correct client c is eventually provided to c and delivered to the user. 4 A client is “authorized” if it has appropriate credentials to interact with the servers. It is the responsibility of the sysadmin to secure access to authorized clients. 59 4.2.2 Application requirements We explicitly depart from the “standard” requirements imposed on applications by replication libraries in the PBFT lineage [18, 86, 24, 26, 49, 92, 107, 104] in two important ways. First, as stated above, we require applications to execute batches, rather than individual requests, in a specified linearized order. Second, we explicitly charge the application with taking and loading checkpoints. Previous libraries have provided automatic checkpointing functionality and required the application to place all relevant information in a library managed (and checkpointed) memory space. We define some basic terminology that is useful in understanding the application properties and API. Let H be a linearized sequence of ordered batches. Let Hi be the sequence of the first i ordered batches. Let SH be the state of the application after executing every batch in H in order. Let CH be the checkpoint of state SH . Let H : b be a linearized sequence of ordered batches where b is the last batch in the sequence. Let Rb be a set of responses generated when processing batch b. We require the application to implement three basic functions: execute a batch, take a checkpoint, and load a checkpoint. The execute batch function exec : S × b → hS, Ri is a function from an application state and a batch of requests that transitions the application to a new state and produces a set with one response per request in the batch. The take checkpoint function takeCP : S → C generates a checkpoint that describes a valid application state and the load checkpoint function loadCP : S ×C → S sets the application state to the state described by the specified checkpoint56 . In a break from our normal pattern, we specify the application liveness properties first: APPL1 exec(S, b) returns a set of responses Rb . APPL2 takeCP (S) returns a checkpoint C. APPL3 loadCP (S, C) returns. 5 Note that exec, takeCP , and loadCP are not functions in the strictly mathematical sense. They correspond instead to functions in a programming API, as such there is no guarantee that exec will produce the same output each time it is provided with a specified input. 6 Note that exec and loadCP don’t return an application state S, but rather transition the application to be in the specified state 60 In short, the liveness properties correspond to providing a terminating implementation of the execute, take checkpoint, and load checkpoint functions. The required application safety properties are more interesting as they aggressively restrict the behavior of the application: APPS1 Only requests contained in batches received from the library are executed. APPS2 ′ , R′ i then S ′ If exec(SH , b) = hSH:b , Rb i and exec(SH , b) = hSH:b H:b = SH:b and b Rb = R′b APPS3 ′ then C = C ′ If takeCP (SH ) = CH and takeCP (SH ) = CH H H APPS4 ∀S : loadCP (S, takeCP (SH )) = SH APPS1 ensures that the application does not execute random requests. APPS2 en- sures that the application executes batches deterministically. Deterministic batch execution is (a) useful when running multiple copies of the application and (b) vital because the UpRight library relies on execution replay to tolerate transient crashes. Note that deterministic batch execution requires that the application, when given the same sequence of batches from a specified starting state, (a) reaches the same final application state and (b) generates the same set of responses. APPS3 ensures that checkpoints generated by the application are deterministic based on when in the execution sequence take checkpoint is called. APPS4 ensures that after loading a checkpoint, the application is in the same state as when take checkpoint was called. While these properties are intuitive, they are not intrinsic to every application. APPS2 for example, is violated by any application that puts a system time-stamp on every response it generates as rolling back the application state and re-executing requests would result in a different set of time-stamps on the generated responses. APPS3-4 are violated by the checkpoints generated by replicas in the ZooKeeper dis- tributed coordination service [108]. ZooKeeper replicas generate checkpoints at time t by asynchronously recording an application snapshot to stable storage and logging all requests processed between time t and when the snapshot is completed. The resulting checkpoint is loaded through a two-step process of first loading the snapshot and then replaying the log of requests7 . This checkpoint procedure ensures a pair of weaker properties than we target. Specifically, ZooKeeper ensures that 7 ZooKeeper requests are idempotent. 61 takeCP (SH ) = CK and ∀S ′ : loadCP (S ′ , takeCP (SH )) = SJ where H is a prefix of both K and J . 4.3 Looking forward This chapter lays the framework for the design of the UpRight library and the expectations we place on the library and replicated applications. The chapters that follow expand on this key points made in this chapter and the interplay between the goals laid out in this chapter and the design and use of the UpRight library. Chapter 5 describes the stage level architecture in detail. We focus the discussion on the interaction between correct stages and the properties that each stage must uphold. Chapter 6 describes the replication of each stage. We focus the discussion on how we provide replicated instantiations of the stages that fulfill the properties defined in Chapter 5 in a robust UpRight fault tolerant manner. Chapter 7 describes our experiences with modifying the ZooKeeper distributed coordination service [108] and Hadoop Distributed File System [43] to be compatible with the UpRight library. Our primary interest in Chapter 7 is evaluating the complexity of adapting an existing application to be compatible with the UpRight library. Future work remains on expanding on the techniques employed by Kotla et al. [50] to general applications or developing novel approaches to achieving deterministic parallel execution. 62 Chapter 5 UpRight Stages This chapter treats each stage of the UpRight architecture as a unit, ignoring internal replication details. In practice each stage is implemented by a set of nodes, but to the extent possible we abstract away that detail in this chapter. In particular, the replication we discuss in Chapter 6 will mask individual node crash, omission, and commission failures in the three stages and allows us to treat the stages as abstract correct entities in the current discussion. Our goal in this chapter is to fully describe the messages exchanged between stages1 and the properties that each stage provides. In the context of this chapter we differentiate between correct and idealized stage. An idealized stage follows its specification faithfully and is not limited by practical constraints such as limited memory or power outages. In contrast, a correct node follows its specification faithfully, but has limited memory and is subject to temporary power outages. Even though the replication within each stage masks failures of individual nodes and provides the abstraction of a correct stage, there are multiple challenges that we must address in the stage-level design: (1) clients can be faulty, (2) the network may not be reliable, (3) network resources are limited, (4) node (and by extension stage) storage resources are limited, and (5) an entire stage may transiently crash (i.e. temporary power failure). First, clients can fail. While we can rely on replication to ensure that a stage is correct, clients cannot be replicated and many systems would not trust 1 When discussing the stage-level protocol we refer to a single message. The stage replication discussed in Chapter 6 requires most messages to be sent to every node in the next stage. Receiving messages from a replicated stage generally requires the recipient to gather a quorum of matching messages. 63 clients to be correct even if client replication were possible. There are two distinct challenges associated with handling client failures: we must ensure that requests issued by correct clients are processed by the system and we must ensure that faulty clients are unable to corrupt the system state or prevent correct clients’ requests from being processed. Our high-level solution is to specify a contract for clients to follow—requests issued by a client that fulfills its part of the contract are eventually executed while no guarantees are provided to clients that violate the contract. Second, lossy and asynchronous networks can lose and arbitrarily delay messages. Messages can be lost between any pair of stages: a client request may not reach the authentication stage, an authenticated request may not reach a order stage, an ordered request may not reach the execution stage, and a result may not reach the client. Thus, our second challenge is to ensure that the system is safe and live despite the loss or delayed delivery of various messages. It is well known that it is impossible to achieve safe and live operation in the presence of an asynchronous network with failures [35]; we consequently target liveness only during sufficiently long synchronous intervals. Our high-level solution is to ensure that each stage caches relevant messages in transient memory for retransmission as needed. Note that retransmission can be caused by a push, i.e., the client retransmits a request if it does not receive a timely response, or a pull, i.e., the execution stage requests the retransmission of ordered batch i if it receives batch i + 1 first. Third, network bandwidth is a limited resource. There is a limit to the number of bytes that can be exchanged between nodes. This limit becomes especially important when we consider one of the requirements of the replicated order stage: in order to order a request, every order node must have a copy of the request. Thus, our third challenge is to be economical in our use of network resources and avoid sending unnecessary information between stages or the nodes ins a given stage. Our high-level solution is to order cryptographic hashes of requests hashes and store the request bodies at the authentication stage. The key observation is that request bodies can skip the order stage entirely and go straight to the execution stage. Fourth, nodes have finite memory. The nodes in each stage cannot maintain the arbitrary number of messages that they may be required to retransmit for messages that are lost in transmission. Thus, our fourth challenge is to ensure that each stage garbage collects messages without impeding the system’s ability to make safe and consistent progress (e.g., by discarding a message that may still be needed by 64 another stage). Our high-level solution is to take order and execution checkpoints at pre-determined intervals and garbage collect messages that are made obsolete by recent checkpoints. Efficient checkpoint generation and garbage collection requires us to pay careful attention to the state stored, the timing of when checkpoints are taken, and when stored state can be garbage collected at each stage. Fifth, even though replication can mask failure of some subsets of nodes (i.e. ensure liveness despite u failures and safety despite r commission failures), the nodes in a stage can transiently crash at arbitrary times (e.g. due to power failure). When a node crashes it loses the contents of transient memory, but storing data in persistent memory that survives transient failures is an expensive operation that we would like to avoid when possible. Avoiding frequent access to persistent memory is especially important at the execution stage, which is colocated with an application that may rely heavily on persistent memory access as part of processing requests. Thus, our final challenge is to ensure that sufficient state is stored in persistent memory to allow the system to efficiently resume operation following a transient crash of individual nodes or all nodes in a stage while at the same time limiting the extent to which persistent memory becomes a performance bottleneck. Our high-level solution is to store in persistent memory (1) every request sent by the authentication and order stages and (2) order and execution checkpoints. This solution provides sufficient causal logging [5] to ensure that no information is lost due to catastrophic power failure while minimizing persistent memory contention at the execution stage. The rest of this chapter is organized as follows. Section 5.1 presents a simplified end-to-end protocol across stages targeted at asynchronous networks and correct stages. In the context of this discussion we address the challenges associated with faulty clients and an unreliable asynchronous network; we do not consider the limitations of finite resources or transient crashes. Section 5.2 describes our basic approach for handling network bandwidth limitations. Section 5.3 describes our approach to handling the challenges of finite memory resources and transient crashes. Section 5.4 compiles the full set of safety and liveness properties for each stage into a collection of tables for easy reference. Section 5.5 describes performance optimizations supported by the UpRight prototype. Section 5.6 establishes notational conventions that we rely on in Chapter 6 and Appendix A. Section 5.7 contains extensive pseudo-code and description for each stage. 65 Authentication Valid Request Request Client Order Ordered Request Response Execution Figure 5.1: Message flow between idealized stages in the UpRight architecture. 5.1 Basic stage interactions We present a simplified version of the interactions between stages intended to provide a solid intuition for our goals at each stage and for how the stages interact with clients and each other. Our initial description focuses on the basic properties provided by each stage and on how the stages combine to provide the end-to-end properties defined in Section 4.2.1. In this initial description, we describe the interactions between idealized stages that have infinite memory, are not subject to transient failures, and always follow their specification faithfully; we allow for an unreliable asynchronous network and do not assume that the network or clients are correct. The communication between idealized stages is shown in Figure 5.1. We address the challenges of a faulty network and faulty clients in the expected ways. We rely on clients to retransmit requests until they receive a response, we ensure that all executed requests are executed safely, but we only promise that requests issued by correct clients will be executed. We do not promise anything to faulty clients. 66 5.1.1 Client properties Clients issue requests and accept responses to those requests. As clients cannot be assumed correct, we specify the expected behavior of correct clients—i.e. their side of the contract. A correct client c upholds a pair of safety properties: CS1 Each request issued by client c is assigned a unique request identifier nc starting with 1 and increasing with each subsequent request. CS2 Client c operates in a closed loop: it does not issue request nc > 1 unless it has received a response to request nc − 1. Clients also uphold a single liveness property: CL1 Client c resends request nc until it receives a response. Note that we explicitly consider requests to be different if they are issued by different clients or issued by the same client but with different request identifiers. Additionally, we implicitly assume that requests issued by client c depend on each other: nc + 1 depends on nc and so forth. Clients view the service as a black box. If client c upholds CS1-2 and CL1 then it eventually receives a response to every request that it issues. If c fails to uphold any of the properties then it may or may not receive a response. 5.1.2 Authentication properties The authentication stage validates client requests and sends authenticated copies of those requests to the order stage. The authentication stage validates a request op with request identifier nc from client c if (a) the request is verifiably issued by client c, (b) it has not received any request n′c > nc from client c, and (d) it has not received a request op′ 6= op from client c with request identifier nc . We say that a request is authenticated when the authentication stage creates an authenticated request message containing the request and its associated authentication credentials. A correct verifier2 a that receives an authenticated request message directly from the authentication stage can verify the authenticity of the request based on the authentication credentials. 2 A verifier is any node tasked with verifying the authenticity of a received message. In this context, it refers to a node that receives a message from the authentication stage. 67 We say that an authenticated request message m is one-step transferable if, when a provides m received from the authentication stage to verifier b both a and b make the same determination about the authenticity of the request. Note that if b subsequently passes m to a third verifier c, one-step transferability says nothing about the consistency between the conclusions drawn by verifiers b and c. AS1 Only requests issued by authorized clients are authenticated and every authenticated request is one-step transferable. The authentication stage also provides a simple liveness property: AL1 If the authentication stage receives a request nc issued by correct client c, then an authenticated request message containing request n′c ≥ nc is sent to the order stage. AL1 implies a potentially unexpected handling of retransmitted client requests: if the authentication stage has authenticated request nc from client c and subsequently receives request n′c < nc from c, then it resends authenticated request nc to the order stage. This behavior converts “old” requests to the most recent request processed for a specific client and does not impact correct clients that issue requests in a closed loop. The one-step transferable component of AS1 simplifies the design of the replicated order stage by circumventing challenges associated with handling the Big MAC attack discussed in Chapter 3. 5.1.3 Order properties The order stage receives valid requests from the authentication stage, places them into a batch, and assigns an execution order to the batches. The order stage places each batch into a next-batch message which is sent to the execution stage. The order stage provides the following safety properties: OS1 Only client requests authenticated by the authentication stage are placed into batches, and request nc issued by client c is placed in at most one batch. OS2 Batches contain one or more requests and are assigned monotonically increasing batch identifiers no starting with 1 and increasing by 1 with each subsequent batch. For batches no and n′o with associated times t and t′ , no > n′o → t > t′ . 68 OS3 If request nc > 1 issued by client c is in batch no , then request nc − 1 issued by client c is in batch n′o < no . Note that the order stage enforces CS2 ; the order stage orders request nc for client c only if request nc − 1 has already been ordered. Hence, requests from a faulty client that does not uphold CS2 are not processed. The liveness properties ensured by the order stage are straightforward: OL1 If the order stage receives unordered authenticated request nc issued by correct client c, then the order stage places the request in batch no and eventually sends a next-batch message containing no to the execution stage. OL2 If the order stage receives authenticated request nc from client c that is already in batch no , then it instructs the execution stage to retransmit a response to request n′c from client c in batch n′o where n′c ≥ nc and n′o ≥ no by sending a retransmission message. OL3 If the execution stage requests all batches after ne and the order stage has ordered batches through no > ne , then the order stage resends all ordered batches from ne + 1 through no inclusive. During normal operation, the order stage receives each authenticated client request once, places the request in a batch, and sends the batch to the execution stage for execution. An unreliable asynchronous network can drop messages arbitrarily; when that occurs, some form of retransmission is necessary. Dropped messages and retransmissions impact the order stage in two ways. First, if client c does not receive a response to request nc , then it retransmits the request until it receives a response. This retransmission, in conjunction with the authentication stage, can result in the order stage receiving authenticated request nc multiple times. When the order stage receives from c a request nc that has already been ordered, it instructs the execution stage to retransmit the response to that request—if the order stage has ordered a subsequent request n′c > nc from c then it requests retransmission of the response to request n′c instead. Note that correct clients operate in a closed loop and are consequently not impacted by the retransmission of a later request. 69 Second, if the execution stage misses an ordered batch (i.e. receives batch no but not batch no − 1) then it requests the missing batches from the order stage. The order stage responds to the execution stage with the collection of missing batches. 5.1.4 Execution properties The execution stage delivers batches to the application in the order specified by the order stage. Each batch is delivered to the application exactly once, and the responses provided by the application after executing the contained requests are cached by the execution stage for potential future retransmission. We say that the execution stage is in state no if it has executed every batch with batch identifier at most no . The execution stage transitions from state no to state no + 1 when it executes batch no + 1. We assume that the application executes a batch of requests instantaneously. This assumption simplifies the discussion by masking additional complexity that can be accounted for through engineering the execution stage and application. It does not change the conceptual properties or relationship between the execution stage and the rest of the system. The execution stage provides three safety properties. ES1 Batch no is delivered to the application only if the last batch delivered to the application is no − 1. ES2 Only ordered batches are delivered to the application ES3 Only responses generated by the application are cached or sent to clients. We note that ES1 and ES2 are closely related but distinct properties. ES1 ordered batches are delivered to the application in the specified order. that only ordered batches are delivered to the application, ES2 ensures that ES2 ensures specifically prevents arbitrary requests that are not included in an ordered batch from being delivered to the application. Intuitively, the execution stage ensures that every ordered batch is executed. To achieve that goal, we rely on the following liveness properties: EL1 If the execution stage receives ordered batch no and the last batch it delivered to the application is no − 1, then the execution stage delivers batch no to the application. 70 EL2 If the execution stage receives a response from the application, then it stores the response for retransmission and sends the response to the client that issued the corresponding request. EL3 If the execution stage receives a retransmission instruction for request nc from c in batch no and the last batch executed by the execution stage is ne > no , then the execution stage resends the response to the most recent request n′c ≥ nc executed for client c. EL4 If the execution stage receives a retransmission instruction for request nc from client c in batch no and the last batch executed by the execution stage is ne < no , then the execution stage informs the order stage that it has missed the batches since ne . Note that the execution stage notifies the order stage that it has missed a collection of ordered batches only after receipt of a retransmission request and not when it receives ordered batch messages out of order. This counter-intuitive decision is driven by how we handle limited resources and will be discussed in more detail in Section 5.3. 5.1.5 Putting the stages together Given the stage properties identified above, demonstrating that the combination of stages maintains the desired end-to-end properties is straightforward. Recall the properties that the library is expected to uphold as defined in Section 4.2.1: LS1 Only responses generated by the application are delivered to non-faulty users. LS2 Only non-empty batches are delivered to the application. LS3 Batch no is only delivered to the application if the previously delivered batch is no − 1. LS4 Request nc issued via client c is placed in at most one batch. LS5 If requests nc and n′c issued via client c are in batches no and n′o respectively, then nc < n′c iff no < n′o . 71 LS6 Only requests issued to an authorized client are placed in a batch3 . LS7 Given batches no and n′o and associated times t and t′ , no < n′o → t < t′ . LL1 Any request issued via a correct client is eventually delivered to the application. LL2 Any application-produced response to a request issued via correct client c is eventually provided to c and delivered to the user. Note that the two liveness properties apply only to requests issued via correct clients. Theorem 4. If the authentication, order, and execution stages uphold their respective safety properties then Proof. LS1-7 : Follows from ES3 LS2 : Follows from ES2 . LS3 : Follows from ES1 . LS4 : Follows from OS1 . LS5 : Follows from OS3 . LS6 : Follows from OS1 LS7 : Follows from OS2 LS1 hold. . and AS1 . . Lemma 1. Given eventual synchrony and correct authentication, order, and execution stages LL1 holds. Proof. It follows from CL1 that correct client c issues request nc until it receives a response. There are two cases to consider: (1) the client receives a response, (2) the client does not receive a response. Case 1: The client c has received a response. It follows from ES3 sponses generated by the application are sent to c. It follows from that only re- APPS1 that only requests in batches delivered to the application by the execution stage are exe. cuted. 3 A client is “authorized” if it has appropriate credentials to interact with the servers. It is the responsibility of the sysadmin to secure access to authorized clients. 72 Case 2: The client has not received a response. It follows from CL1 that the client will issue request nc arbitrarily often. It follows from eventual synchrony that the request is received by the authentication stage arbitrarily often. It follows from AL1 that some request n′c ≥ nc from c is authenticated arbi- trarily often and from CS2 that that request is nc . It follows from eventual synchrony that the request is received by the order stage arbitrarily often. Request nc will be placed in a batch and a retransmission instruction for request nc is sent to the execution stage arbitrarily often. The first time request nc is received by the order stage, it follows from CS1 that OS1 is satisfied. It follows from OL1 that nc is ordered in batch no and batch no is sent to the execution stage. Every subsequent time that request nc is received it follows from OL2 that a retransmission instruction for request n′c ≥ nc is sent to the execution stage. It follows from that n′c CS2 = nc . It follows from eventual synchrony that the retransmission instruction is received by the order stage arbitrarily often. Upon receipt of a retransmission instruction by the execution stage there are two possibilities to consider: (1) the batch containing the request has been delivered to the application or (2) the batch has not been delivered. In the first scenario we are done. Consider the second scenario. It follows from EL4 that the execution stage sends the last executed notification to the order stage arbitrarily often. It then follows from OL3 that the order stage sends the missing batches until they are no longer missing or arbitrarily often. It follows from eventual synchrony and induction that all batch messages will eventually be received by the execution stage and that the batches are subsequently delivered to the application. Lemma 2. Given eventual synchrony and correct authentication, order, and execution stages LL2 holds. Proof. It follows from EL1 that the response is cached and sent to the client. If it is received, then we are done. Otherwise, it follows from CL1 that a correct client reissues its request until it receives a response. It follows from AL1 that the request is authenticated and sent to the order stage. It follows from APPS1 that only requests contained in batches are executed and from ES2 that only ordered batches are delivered to the application. It then follows that nc has already been ordered so by sent. It then follows from EL4 OL2 a retransmission message is that the response is retransmitted to the client. The 73 above happens arbitrarily often and by eventual synchrony the response is eventually received by the client. 5.2 Network efficiency When network resources are limited it is important to limit the number of bytes sent across the network. Our basic description of the stage-level protocol has the authentication stage sending authenticated requests to the order stage and the order stage sending those requests to the execution stage. When we peek beneath the covers at the implementation details of the order stage, we see that every order node must receive and maintain a copy of every request that is ordered. We observe that the order stage can add requests to batches and order batches without processing the full requests—a cryptographic hash is sufficient to identify uniquely any ordered request. We consequently modify the authentication stage to cache authenticated requests and send only authenticated request hashes to the authentication stage for ordering. The execution stage then fetches the bodies for all ordered requests prior to executing a batch. The changes to basic operation are depicted in Figure 5.2. We say a request is fetchable if the request body is stored at the authentication stage. Accommodating this change to the protocol requires us to introduce an additional safety and liveness property at the authentication stage. AS2 Every authenticated request is fetchable. AL2 If the authentication stage receives a fetch message from the execution stage for a authenticated request nc issued by client c, then the authentication stage responds with the request body. This change also requires us to modify OS1 to ensure that only requests that are both authenticated and authenticated are ordered4 . OS1 Only fetchable client requests authenticated by the authentication stage are placed into batches, and request nc issued by client c is placed in at most one batch. 4 This modification is subtle and its necessity is not immediately apparent. We will revisit this point in section 6.4. 74 Authentication (2) Valid Request (1) Request (4a) Fetch Body (4b) Body Client Order (3) Ordered Batch (4c) Response Execution Figure 5.2: Messages exchanged between stages. (1) Clients send requests to the authentication stage. (2) The authentication stage sends validated request hashes to the order stage. (3) The order stage sends ordered batches to the execution stage. (4a, 4b) The execution stage fetches request bodies from the authentication stage. (4c) The execution stage sends responses to the clients. Note that the messages travel through the system in a clockwise fashion. 75 Additionally, we expand the liveness property EL1 into two components that target fetching request bodies and executing batches separately. EL1a If the execution stage receives ordered batch no and the last batch it has delivered to the application is n′o < no , then it fetches the request bodies for requests in batch no from the authentication stage. EL1b If the execution stage has all of the request bodies for batch no and the last batch it delivered to the application is no − 1, then the execution stage delivers batch no to the application. Note that EL1b EL1 can be acquired by combining the non-italicized portions of EL1a and . 5.3 Garbage collection and transient crashes Of the five challenges identified at the beginning of this chapter, two remain unaddressed: (1) stages have finite memory and (2) stages can exhibit transient crashes. The mechanisms used to address these challenges are closely intertwined. Individual machines have finite memory. The retransmission mechanisms used to mask asynchronous network behaviors can require the authentication and order stages to cache for retransmission an arbitrary number of requests and ordered batches respectively. We address this problem through the combination of (a) checkpoint generation and garbage collection and (b) stage interlocking. Checkpoint generation and garbage collection allow us to periodically eliminate a prefix of the state at each stage, and stage interlocking allows us to prevent one stage from getting too far ahead of, or behind, the others. Figure 5.3 depicts the state stored by each stage and how the pieces interact; the rest of this section is devoted to a detailed description of the basic approach highlighted here. The order stage takes an order stage checkpoint every CP interval batches. We ensure that the order stage always maintains at least one checkpoint of its state and a log of between CP interval and 2 × CP interval batches ordered since that checkpoint was generated. We ensure that the execution stage maintains an execution stage checkpoint that corresponds to each checkpoint stored at the order stage and that the authentication stage has the bodies of all requests ordered in subsequent batches. We coordinate garbage collection at the three stages to ensure 76 that the authentication stage only garbage collects request bodies when they are no longer needed by the execution stage, and the execution stage garbage collects checkpoints only when they are no longer referenced by the order stage. At a high level, these steps ensure that following a transient crash of one or more stages we can resume operation as if nothing went wrong. We differentiate between transient and persistent memory. The content of transient memory may be lost during a transient crash; the content of persistent memory persists through a transient crash. We ensure that stages survive transient crashes by recording the checkpoints and associated state in persistent memory. The order stage stores order checkpoints and the log of ordered batches in persistent memory, the execution stage stores execution checkpoints in persistent memory, and the authentication stage stores authenticated request bodies in persistent memory. In the rest of this section we present a stage-by-stage description of the state required by each stage, the construction of checkpoints, and the timing of accesses to persistent memory. The details are tedious, but the specific design choices directly impact the set of properties that each stage must maintain and consequently have significant impact on the stage replication discussed in Chapter 6. Readers not interested in the discussion may wish to read only the “Additional properties” sections in the following text. 5.3.1 Order stage State. The basic operation described so far requires the order stage to maintain (1) the log of ordered batches, (2) a table containing information (request and batch identifiers) on the last request ordered for each client, and (3) the next batch identifier to be consumed. The log of ordered batches is required to support batch retransmission required by a lossy network; the last ordered table is required to support appropriate handling of retransmitted client requests; the next batch identifier is used to ensure that there are no gaps or repeats in the sequence of batch identifiers. The order stage maintains two additional pieces of information: a concise description of the history of ordered batches and the current time. The batch history is a required component of the design of our replicated order stage, discussed in Section 6.2, and it is included in the current discussion for completeness only. 77 pending (Requests per client) Authentication Stage commandCache (ordered request bodies) cached (ordered batches) Order Stage CPint < CPint Order Checkpoints Execution Stage Execution Checkpoints X % CPint = 0 X + CPint Figure 5.3: Interactions between persistent state at each stage. The state maintained by the other stages depends on the state maintained at the order stage. The order stage maintains one or two checkpoints and between CP interval and 2 × CP interval − 1 ordered batches. The authentication stage maintains every request referenced by an ordered batch stored at the order stage and at most one pending request per client. The execution stage maintains two checkpoints that correspond to order stage checkpoints. Additional details on the contents of the order and execution checkpoints can be found in Figure 5.4 and Figure 5.5 respectively. 78 The order stage maintains the official system time for the UpRight library and any application replicated with the library. The time is included as part of each ordered batch, and the order stage guarantees that time is monotonically increasing with each batch. Including this time field in each ordered batch is an important part of (a) tolerating transient crashes at the execution stage and (b) facilitating application replication. We discuss the time field in more detail when discussing the execution stage in Section 5.3.2. We define an order stage checkpoint no to be a snapshot of the order stage state taken when all batches n′o < no have been ordered and no batch n′′o ≥ no has been ordered. An order stage checkpoint is depicted in Figure 5.4: it contains the next batch identifier to be consumed, the last ordered table, the history and time fields, and an execution checkpoint token taken at the same relative point in logical time (i.e. after processing batch no − 1 and before processing batch no ). The execution checkpoint token is a concise representation (i.e., a hash) of an execution stage checkpoint. When an order stage checkpoint is initially generated, the execution checkpoint token is initially null; that field of the order checkpoint is filled in only after the checkpoint is relayed from the execution stage. We say that an order stage checkpoint is complete if it contains the execution checkpoint token and incomplete otherwise. Garbage collection and transient crash recovery. The order stage always maintains a base checkpoint taken at nCP , where nCP mod CP interval = 0, a secondary checkpoint taken at nCP + CP interval , and a log of between CP interval and 2 × CP interval batches ordered since the base checkpoint. The base checkpoint and log of ordered batches are stable, i.e. stored in persistent memory. The secondary checkpoint may or may not be complete or stable. When the secondary checkpoint becomes complete, it is stored in persistent memory and made stable; only complete checkpoints are stable. To bound state, the order stage does not order batch nCP + 2 × CP interval unless the secondary checkpoint at nCP + CP interval is both stable and complete. This restriction on ordering batch nCP + 2 × CP interval ensures that the order stage is responsible for caching at most 2 × CP interval batches, each containing at most one request per client. When batch nCP + 2 × CP interval − 1 is ordered, three steps are taken: 79 lastOrdered Next Batch Identifier History Time Client Request Batch ID ID Execution Checkpoint Token Figure 5.4: Order stage checkpoint. 80 1. A new secondary checkpoint at nCP + 2 × CP interval is generated. 2. The old secondary checkpoint at nCP + CP interval becomes the new base checkpoint. 3. The old base checkpoint at nCP and all batches with identifier no < nCP + CP interval are garbage collected. In our implementation, we facilitate garbage collection of persistent memory by storing each stable checkpoint in its own file (checkpoint nCP is stored in the file “order CP.i” where i = nCP CP interval mod 2) and each block of CP interval ordered batches in a single file (batches no through no +CP interval −1 are appended to the file “order log.i” where i = nCP CP interval mod 2). Note that each ordered batch is recorded to the appropriate log file before it is sent to the execution stage. When checkpoint nCP is garbage collected, the files “order CP.i” and “order log.i” are cleared. The order stage recovers from a transient crash by reading the base checkpoint, secondary checkpoint, and log of ordered batches from persistent memory. If there is no secondary checkpoint stored in persistent memory, then the secondary checkpoint is derived from the base checkpoint and the first CP interval ordered batches in the ordered batch log. Additional properties. The techniques described above entail an additional safety property maintained by the order stage: OS4 The order stage always maintains in persistent memory a stable checkpoint at no , where no mod CP interval = 0, and CP interval ≤ i ≤ 2 × CP interval subsequent ordered batches. The authentication and execution stages rely on this property to determine when it is safe for them to garbage collect state. Garbage collecting the log of ordered batches prevents the order stage from resending arbitrarily old batches to the execution stage. We consequently modify OL3 to require the order stage to resend only recent batches and add a new liveness property that requires the order stage to send the execution checkpoint descriptor if the execution stage is further behind5 . 5 These considerations will become more apparent when we discuss execution stage garbage collection and transient crash recovery. 81 OL3 If the execution stage requests all batches after ne and the order stage has ordered batches through no ≥ ne and ne + 1 ≥ nCP , then the order stage resends all ordered batches from ne through no . OL4 If the execution stage requests all batches after ne and the order stage has ordered batches through no > ne and ne + 1 < nCP , then the order stage instructs the execution stage to load execution checkpoint nCP . 5.3.2 Execution stage. State. The execution stage maintains a replyCache consisting of the last response sent to each client, the identifier ne of the next batch to be delivered to the application, and a potentially empty set of request bodies for batches that have not yet been delivered to the application. The application is a component of the execution stage; the application state is consequently also part of the state of the execution stage. We define an execution stage checkpoint ne to be a snapshot of the execution stage taken when all requests in batches n′e < ne have been executed by the application and no request in any batch n′′e ≥ ne has been executed. An execution stage checkpoint contains the replyCache, the batch identifier of the next unexecuted batch, and a snapshot of the application state as shown in Figure 5.5. Garbage collection and transient crash recovery. The execution stage gener- ates a new execution stage checkpoint before delivering batch no mod CP interval = 0 to the application. After generating the checkpoint, the execution stage stores the checkpoint to persistent memory and sends a token (i.e. a hash) that uniquely describes the checkpoint to the order stage. Execution checkpoint no is written to file “exec CP.no .” Following a transient crash, the execution stage does nothing until it receives a message from the order stage. Because the execution stage starts off in a default state with ne = 0, it is unlikely to be able to execute the first batch that it receives and will eventually receive a retransmission request. Following receipt of the retransmission request, it notifies the order stage that it has last executed request 0, at which point the order stage instructs the execution stage to load a specific checkpoint by passing the checkpoint token defined by the execution stage. The ex82 replyCache Next Batch Identifier Client Reply Application Checkpoint Figure 5.5: Execution stage checkpoint. 83 ecution stage reads the checkpoint from persistent memory, confirms that the bytes it reads are consistent with the checkpoint token, and then resumes operation using the freshly loaded checkpoint. The execution stage garbage collects execution checkpoint ne when it receives an ordered batch with identifier no ≥ ne + 2 × CP interval . This garbage collection is timed to ensure that the execution stage garbage collects a checkpoint only after the order stage has garbage collected any references to that execution checkpoint—ensuring that the order stage will not expect the execution stage to load the checkpoint in the future. Network efficiency. Note that we do not send the execution checkpoint, but rather a token describing the checkpoint, to the order stage. Sending the token in place of the full checkpoint reduces (a) network traffic and (b) state maintained at the order stage. In Chapter 6 we discuss the implications of other designs, notably storing the full checkpoint or nothing at all in the order stage checkpoint. Additional properties. We add two new safety properties and a single new live- ness property to the execution stage. The safety properties are straightforward. First, the execution stage is required to maintain in persistent memory any checkpoint that it may be instructed to load by the order stage. Second, we require the execution stage to replay previously executed batches; the execution of a batch following a checkpoint load must correspond to the execution of that batch that preceded the checkpoint load. The additional liveness property is similarly straightforward: we require the execution stage to load a specified checkpoint on demand. ES4 The execution stage maintains in persistent memory the execution checkpoint referenced by the order-stage base checkpoint. ES5 The execution stage provides deterministic and replayable execution of ordered batches. EL5 If the execution stage receives an instruction to load checkpoint ne from the order stage, then it loads execution checkpoint ne . Application implications. Note that the application property APPS2 is impor- tant because the UpRight library relies on log-based rollback recovery [34] to recover 84 from transient crashes. Without the deterministic execution provided by APPS2 the execution stage could produce different responses when re-executing the set of ordered batches after loading the old checkpoint. We believe that deterministic execution may be unnecessary if the application can provide fine-grained checkpoints and support checkpoint-based rollback recovery [34]. Fine-grained application checkpoints would allow the execution stage to generate a checkpoint after executing each batch—the system could then agree on the results (responses generated and state reached) of processing each batch before outputting the reponses to the clients. Note that supporting this form of operation would require the library to agree on the result of executing each batch rather than than the order of batches. We leave the exploration of application techniques for efficient fine-grained checkpoints and architectural changes to support speculative execution to future work. 5.3.3 Authentication stage State. Compared to the order and execution stages, the state maintained by the authentication stage is straightforward, consisting only of the requests that it has authenticated. There are three different types of requests that the authentication stage must store at all times: (1) any request that has been ordered since the current order-stage base checkpoint, (2) any request that has been authenticated and not yet ordered, and (3) the last request authenticated for each client. In most cases the last request authenticated for a client c is either pending or ordered since the base checkpoint; the exception to this rule occurs when a client has been inactive for an extended period of time. The primary challenge in garbage collecting the state of the authentication stage is connected to the maintenance of the second type of requests—requests that have been authenticated but not yet ordered when it is time to perform garbage collection. The authentication stage maintains three tables in transient memory. The first table, lastSent, is indexed by client identifier c and contains the last authenticated request sent to the order stage on behalf of that client. The second table, pending, contains up to one tuple hc, nc , opi per client and identifies the body of any request authenticated but not yet ordered for client c. When the authentication stage authenticates a client request, it adds the body to the pending table and the 85 authenticated request message sent to the order stage to the lastSent table. The third table, commandCache, stores one tuple hc, nc , op, no i per request ordered since the current order-stage base checkpoint. When the authentication stage learns that request nc issued by client c is ordered in batch no , it moves the request body from the pending table into the commandCache. The commandCache is implemented as a set of three distinct tables commandCache{0,1,2} . Request bodies ordered in batch no are stored in commandCachei where i = no CP interval mod 3. Note that the authentication stage effectively maintains 3 checkpoint intervals worth of requests, in contrast to the 2 checkpoint intervals worth of batches maintained by the order stage. Our experience indicates that a slow execution replica is more likely to successfully catch up following a transient crash when the authentication stage caches 3, rather than 2, checkpoint intervals worth of requests. This benefit results from a race condition between the replica successfully fetching the appropriate execution stage checkpoint and the occurrence of the next garbage collection. Garbage collection and transient crash recovery. When the authentication stage learns that batch no has been ordered, it can safely garbage collect any request bodies ordered prior to batch no −CP interval CP interval . We take a very simple approach to garbage collection. The authentication stage keeps track of the identifier no for the maximal ordered batch that it has observed. The first time it learns that batch n′o has been ordered, where where i = n′o CP interval n′o CP interval > no CP interval , it garbage collects commandCachei mod 3. In order to survive transient crashes, authenticated requests must be stored in persistent memory. To that end, the first time a request nc from client c is authenticated, it is stored to a persistent log of authenticated requests before the authenticated request is sent to the order stage. Garbage collecting the log can be difficult because there may be very little correlation between when a request is authenticated and the ordered batch that it eventually appears in. We consequently maintain a log of authenticated request bodies in a set of three distinct log files organized as a circular buffer: “authentication log.{0,1,2}.” As requests are authenticated and placed in the pending and lasSent tables, their bodies are recorded into the currently active log file “authentication log.i” where i = no CP interval mod 3 and no is the maximal batch identifier that the authentication stage has observed. Note that this 86 log corresponds to the most recently updated commandCachei and not necessarily the commandCache where the request will eventually be placed. The authentication stage switches from “authentication log.i” to “authentication log.j” when it garbage collects commandCachej . At that point, the authentication stage closes “authentication log.i” and clears the contents of “authentication log.j.” It then dumps the base sequence number of the current checkpoint interval (i.e. no CP interval × CP interval ) and the contents of the pending table to “authentication log.j”—ensuring that any request placed in commandCachei is also present in “authentication log.j” . After logging the pending table, the authentication stage resumes processing client requests, recording request bodies to the log file as they are added to the pending table. Following a transient crash, the authentication stage reconstructs the pending, commandCache, and lastSent tables from the log files. For each client c, the logged request with maximal nc is recorded as the entry for pending[c] and a corresponding authenticated request message is placed in lastSent[nc ]. If there is no entry for a client, then lastSent[nc ] is left empty. Every request body recorded in file “authentication log.i” is added to commandCachei , with the exception of requests stored in the pending table. Note that requests initially authenticated and recorded in “authentication log.i” may be placed in a different commandCachej when they are finally ordered. By recording the pending table to the beginning of each log file, we ensure that any request that is ordered during the interval covered by commandCachej is present in “authentication log.j.” Given a base sequence number base recorded at the beginning of “authentication log.j,” the authentication stage does not garbage collect commandCachej until it after it learns that a batch with identifier no ≥ base + 3 × CP interval has been ordered. Additional properties. Garbage collecting outdated request bodies slightly mod- ifies the set of requests that are fetchable by the execution stage: AS2 Every authenticated request referenced by a batch ordered since the base checkpoint at the order stage or not yet ordered is fetchable. We also modify the primary liveness property to ensure increasing client request identifiers or the retransmission of any pending requests. AL1a If the authentication stage receives a request nc issued by correct client c and 87 there is no pending request n′′c < nc , then request n′c ≥ nc is authenticated and sent to the order stage. AL1b If the authentication stage receives a request nc issued by correct client c and there is a pending request n′c , then request n′c is authenticated and sent to the order stage. We also modify two liveness properties at the execution stage to ensure that the authentication stage eventually learns that requests have been ordered: EL1a If the execution stage receives ordered batch no and the last batch it has delivered to the application is n′o < no , then it fetches the relevant request bodies from the authentication stage and notifies the authentication stage that the contained requests have been ordered. EL3 If the execution stage receives a retransmission instruction for request nc from c in batch no and the last batch executed by the execution stage is ne > no , then the execution stage resends the response to the most recent request n′c ≥ nc executed for client c and notifies the authentication stage that n′c has been ordered no later than batch ne . Fetching a request body implicitly notifies the authentication stage that the request has been ordered, we simply make that implicit knowledge explicit. Retransmission requests received by the execution stage can be caused not only because the client failed to receive a response, but also because the authentication stage did not receive the notification that a pending request has been successfully ordered. Rate limiting. Faulty clients can issue an arbitrary number of requests and inflate the size of the pending request table. With the goals of robust fault tolerance (Chapter 3) in mind, the authentication stage maintains at most one pending request per client and authenticates at most one request per client per request identifier. AS3 At most one request per identifier nc per authorized client c is authenticated. AS4 When request nc from client c is authenticated, no request n′c > nc has been authenticated and there is no pending request n′c < nc . 88 In short, we ensure that there is at most one outstanding request per client waiting to be ordered. Note that we don’t require the authentication stage to increment client request identifiers by one even though correct clients obey that restriction. This decision is a nod to the implications of a replicated authentication stage and the reality, discussed in Chapter 6, that client requests can be processed by the system without being processed by every authentication replica. 5.3.4 Client A client intended to survive transient crashes must store its most recent request in persistent memory. If it does not store the request in persistent memory, then it may not be able to resume operation following a transient crash because it may be unable to recreate a request nc that was authenticated but not ordered—the authentication stage will reject the request according to CS3 AS1 . Client c stores the most recently issued request in persistent memory. While it is easy to imagine techniques where the client sends a special “I’m starting over” message after a transient crash, the internal details of how the authentication stage is replicated complicates matters. Specifically, it is possible for a client request to reach only a subset of the authentication replicas before the client suffers the transient crash. Because each authentication replica individually authorizes at most one request per client c per request identifier nc , this state divergence can prevent any subsequent requests issued by client c from being authenticated. 5.4 Full property list In this section we consolidate the stage properties describe in Sections 5.1,5.2,and 5.3 into one location. These properties, and the pseudo-code descriptions in Appendix ??, form the basis of the replicated stage implementations discussed in Chapter 6. 5.4.1 Client Properties A correct client issues one request per request identifier, consumes request identifiers sequentially, and resends each request until it receives a response. The complete set of properties provided by a correct client follow: 89 CS1 Each request issued by client c is assigned a unique request identifier nc starting with 1 and increasing with each subsequent request. CS2 Client c operates in a closed loop: it does not issue request nc > 1 unless it has received a response to request nc − 1. CS3 Client c stores the most recently issued request in persistent memory. CL1 Client c resends request nc until it receives a response. 5.4.2 Authentication stage properties The authentication stage authenticates requests to the order stage and caches request bodies as long as they may be required by the execution stage. The authentication stage ensures that only requests from authorized clients are authenticated. The complete set of properties provided by the authentication stage follow: AS1 Only requests issued by authorized clients are authenticated and every authenticated request is one-step transferable. AS2 Every authenticated request referenced by a batch ordered since the base checkpoint at the order stage or not yet ordered is fetchable. AS3 At most one request per identifier nc per authorized client c is authenticated. AS4 When request nc from client c is authenticated, no request n′c > nc has been authenticated and there is no pending request n′c < nc . AL1a If the authentication stage receives a request nc issued by correct client c and there is no pending request n′′c < nc , then request n′c ≥ nc is authenticated and sent to the order stage. AL1b If the authentication stage receives a request nc issued by correct client c and there is a pending request n′c , then request n′c is authenticated and sent to the order stage. AL2 If the authentication stage receives a fetch message from the execution stage for a authenticated request nc issued by client c, then the authentication stage responds with the request body. 90 5.4.3 Order stage properties The order stage places authenticated requests into batches and assigns an execution order to batches. Each distinct request is placed in at most one batch; if it receives a request multiple times then the order stage requests a retransmission of the result rather than ordering the request for execution multiple times. The complete set of properties provided by the order stage follows: OS1 Only fetchable client requests authenticated by the authentication stage are placed into batches, and request nc issued by client c is placed in at most one batch. OS2 Batches contain one or more requests and are assigned monotonically increasing batch identifiers no starting with 1 and increasing by 1 with each subsequent batch. For batches no and n′o with associated times t and t′ , no > n′o → t > t′ . OS3 If request nc > 1 issued by client c is in batch no , then request nc − 1 issued by client c is in batch n′o < no . OS4 Order stage always has stable checkpoint at no , where no %CP interval = 0, and CP interval ≤ i ≤ 2 × CP interval subsequent ordered batches. OL1 If the order stage receives unordered authenticated request nc issued by correct client c, then the order stage places the request in batch no and eventually sends a next-batch message containing no to the execution stage. OL2 If the order stage receives an authenticated request nc from client c that is already in batch no , then it instructs the execution stage to retransmit a response to request n′c from client c in batch n′o where n′c ≥ nc and n′o ≥ no . OL3 If the execution stage requests all batches after ne and the order stage has ordered batches through no > ne and ne + 1 ≥ nCP , then the order stage resends all ordered batches from ne through no . OL4 If the execution stage requests all batches after ne and the order stage has ordered batches through no > ne and ne + 1 < nCP , then the order stage instructs the execution stage to load execution checkpoint nCP . 91 5.4.4 Execution stage properties The execution stage processes batches of requests in the order specified by the order stage. For each processed request, it delivers the result to the client that issued the request. The complete set of properties provided by the execution stage follows: ES1 Batch no is only delivered to the application if the last batch delivered to the application is no − 1. ES2 Only ordered batches are delivered to the application. ES3 Only responses generated by the application are cached or sent to clients. ES4 Execution stage maintains the execution checkpoint referenced by the order -stage base checkpoint in persistent memory. ES5 The execution stage has deterministic and replayable execution of ordered batches. EL1a If the execution stage receives ordered batch no and the last batch it has delivered to the application is n′o < no , then it fetches the request bodies for requests in batch no from the authentication stage and notifies the authentication stage that the contained requests have been ordered. EL1b If the execution stage has all of the request bodies for batch no and the last batch it delivered to the application is no − 1, then the execution stage delivers batch no to the application. EL2 If the execution stage receives a response from the application, then it stores the response for retransmission and sends the response to the responsible client. EL3 If the execution stage receives a retransmission instruction for request nc from c in batch no and the last batch executed by the execution stage is ne > no , then the execution stage resends the response to the most recent request n′c ≥ nc executed for client c and notifies the authentication stage that n′c has been ordered no later than batch ne . 92 EL4 If the execution stage receives a retransmission instruction for request nc from client c in batch no and the last batch executed by the execution stage is ne < no , then the execution stage informs the order stage that it has missed the batches since ne . EL5 If the execution stage receives an instruction to load checkpoint ne from the order stage, then it loads execution checkpoint ne . 5.5 Supported optimizations We support three operation paths in addition to the basic protocol operation described in the previous sections: (a) request pre-fetching between the authentication and execution stages, (b) read-only request execution, and (c) spontaneous server generated replies. Request prefetching. As a performance optimization, the UpRight library sup- ports request pre-fetching between the authentication and execution stages. When request pre-fetching is enabled, the authentication stage sends the request to the execution stage when it sends the authenticated request hash to the order stage. Request pre-fetching reduces the latency to process requests by ordering and distributing requests in parallel rather than sequentially. Read only replies. As a performance optimization, the UpRight library supports PBFT’s read-only optimization [18], in which a client sends read-only, side-effectfree requests directly to the execution stage and the execution stage processes them without ordering them in the global sequence of requests. If the client receives a response, the client can use the reply; otherwise the request is concurrent with an interfering operation, and the client must reissue the request via the normal path to execute the request in the global sequence of requests. To support this optimization, the client and execution stage must identify read only requests. Small requests. As a performance operations, the UpRight library does not or- der hashes of “small” requests (i.e. requests of less than 100B) and instead orders the requests themselves. Small requests are placed directly into the verified request 93 messages sent to the order stage. When the execution stage receives a batch containing a small request, the request can be executed directly without being fetched from the authentication stage. Spontaneous replies. Replication libraries are designed around an implicit as- sumption that all client-server communication follows a simple pattern: clients issue requests and servers generate an immediate response for each processed request. In reality, not all interactions follow this pattern. Specifically, in some systems (a) every client request may not elicit a response from the server or (b) the server can push a response to a client without being prodded to do so by a specific request. In the first case, it is straightforward to have the server send a null response back to the client to discharge the obligations of the request-response pattern. The latter case is more difficult, as it is difficult to force the client to issue a request for a response it may not be expecting. The UpRight library provides unreliable channels for push events. We posit that most client-server systems that rely on push events already cope with the “lost message” case (e.g., to handle the case when the TCP connection is lost and some events occur before the connection to the server can be reestablished), so existing application semantics are preserved. In our implementation, the execution stage includes sequence numbers on push events, sends them in FIFO order, and attempts to resend them until they are acknowledged, but can unilaterally garbage collect any pending push events at any time. The client signals the (presumed existing) application lost-message or lost-connection handler. 5.6 Messages and notation Messages exchanged among the client and the three server stages are shown in Table 5.1. We augment the message structure and fields with the identity of the stage that sends and receives the message. We expand on the use and meaning of each message in subsequent sections. Details on the byte definitions of these messages can be found in Appendix A.2. There is a significant amount of notation introduced in the message definitions above. We explain that notation in Table 5.2 and the following text. We use c to indicate a client. Each time client c issues a command op, it 94 Message hclient-req, hreq-core, c, nc , opi, ciµ~ c,F hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O hcommand, no , c, nc , op, f iµf,e htoCache, c, nc , op, f iµ~ f,E hnext-batch, v, no , H, B, t, bool, oiµ~ o,E hrequest-cp, no , oiµ~ o,E hretransmit, c, no , oiµ~ o,E hload-cp, Tcp , no , oiµo,e hbatch-complete, v, no , C, eiµ~ e,F hfetch, no , c, nc , hash(op), eiµ~ e,F hcp-up, no , C, eiµ~ e,F hlast-exec, ne , eiµ~ e,O hcp-token, no , Tcp , eiµ~ e,O hcp-loaded, no , eiµ~ e,O hreply, nc , R, H, e, iµe,c Sent by client authentication authentication authentication order order order order execution execution execution execution execution execution execution Received by authentication order execution execution execution execution execution execution authentication authentication authentication order order order client Table 5.1: Message specification for messages exchanged between stages. The sender and recipients of the messages are indicated. Notation c op nc R f F o O p e E B C no H ne Tcp µi,j µ ~ o,E µ ~ F ,O Meaning Client identifier Client command Client request identifier Result of processing client command Authentication replica identifier Authentication stage Order replica identifier Order stage “Primary” order replica Execution replica identifier Execution stage Batch of client requests List of client request identifiers Batch sequence number History of ordered batches Sequence number of last executed batch Execution stage checkpoint MAC from replica i to replica j MAC authenticator from replica o to stage E Matrix signature from stage F to stage O Table 5.2: Summary of symbols used and their meanings. 95 binds the command to a unique identifier nc . We differentiate between stages and replicas. A stage refers to the collection of replicas that work together to provide the authentication, order, and execution abstractions. We use f to refer to individual authentication replicas and F to refer to the authentication stage; o refers to a single order replica and O refers to the order stage, additionally p refers to a designate order replica also called the primary; e refers to a single execution replica and E refers to the execution stage. The order stage collates multiple client requests into a batch B. A batch consists of sets of tuples hc, nc , opi and is associated with a non-determinism unit t consisting of the system time and a pseudo random seed6 . Batches are assigned a unique sequence number no by the order stage. The history H of batches, records the sequence of ordered batches including the time and PRNG seed associated with each batch. The history at batch no is computed as Hn = hash(Hn−1 , Bn , tn ). We use the history tie successive batches together. The execution stage reports a sequence number as ne and periodically reports checkpoints to the order stage through a checkpoint token Tcp . The set C is composed of tuples hc, nc i corresponding to the last request identifier nc executed by the execution stage for each client c. We generically use hash(B) to indicate a hash of B and µi,j to indicate a MAC authenticated by replica i for verification by replica j. We use µ ~ o,E to indicate a MAC authenticator generated by replica o for verification by every execution replica and µ ~ F ,O to indicate a matrix signature [3] generated by the replicas in F for authentication by every replica in O. 5.7 Stage level pseudo-code In this section we provide pseudo-code for correct clients and the authentication, order, and execution stages. This pseudo-code describes the basic operation for each stage and provides the foundation that the replicated stages discussed in Chapter 6 will emulate. The pseudo-code presented in this chapter implements the stage and client properties listed in Section 5.4 and is presented here for completeness and concreteness. Most readers will want to skip this section. 6 Note that we explicitly separate the time and pseudo-random seed from the batch as part of the implementation. The time and pseudo-random seed are logically a part of the batch of requests. 96 \\ n e x t r e q u e s t ID \\ o u t s t a n d i n g c l i e n t 1 2 nc := 0 out := ∅ 4 5 6 7 issueCommand ( op ) : i f out 6= ∅ then block out := hclient-req, hreq-core, c, nc , opi, ciµ ~ 8 9 10 request c,F s t o r e out to p e r s i s t e n t memory send out to F s t a r t timer 12 13 14 15 16 on rcv m = hreply, nc , R, H, e, iµe,c : i f m.nc = nc then nc := nc + 1 d e l i v e r R to u s e r c l e a r timer 18 19 20 on t i m e o u t : i f out 6= ∅ then send out to F 22 23 on r e c o v e r y : load out := hclient-req, hreq-core, c, nc , opi, ciµ ~ 24 25 26 c,F from p e r s i s t e n t memory nc := out.nc send out to F s t a r t timer Figure 5.6: Pseudo-Code for the client 5.7.1 Client operation Client operation is straightforward and simple. The client issues commands, but before sending the command to the authentication stage it stores the client request message containing the command and the current client request identifier nc to persistent memory. It continues resending the client request message until it receives a response from the authentication stage. Pseudo-code for the client is shown in Figure 5.6. In our implementation, clients use an adaptive retransmission policy. When the system starts, the timeout is initially set to 500 ms. Each time a retransmission is required, the timeout is doubled up to a maximum of 4000 ms. When a response is received, the base timeout is set to the maximum of 500ms and the observed latency for the previous request-response pair. 5.7.2 Authentication operation The authentication stage is responsible for authenticating client requests and caching the request bodies so that they can be fetched by the execution stage. Pseudo-code for the authentication stage is shown in Figure 5.7. We describe the operation of the authentication stage by detailing the state maintained and the processing of each of 97 the three messages it receives from other participants in the system. Data structures. The authentication stage maintains three data structures: (1) a set lastSent indexed by client identifiers that stores the last validated request message hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O sent to the authentication stage for each client, (2) a set pending indexed by client identifiers that stores any validated requests that have been validated but not yet ordered for each client, and (3) a set commandCache indexed by a client identifier/client request identifier tuple that stores bodies of validated requests and the batch identifier for the batch containing that request. The lastSent set aids in processing retransmitted requests. The pending and commandCache sets are fundamental in ensuring that any authenticated request body is fetchable. In addition to the three sets described above, the authentication stage maintains a batch identifier nf that represents the next batch identifier that the authentication expects to see ordered. Processing hclient-req, hreq-core, c, nc , opi, ciµ~ c,F . The primary task of the authentication stage is authenticating client requests. When the authentication stage receives a request from client c, it first checks if a request from c with request identifier n′c ≥ nc has already been authenticated by examining the contents of lastSent[c]. If request n′c ≥ nc has been authenticated, then the authentication stage resends the authenticated request message stored in lastSent[c] to the order stage. If no request n′c ≥ nc has been authenticated for client c then the authentication stage confirms that the last request authenticated for c has been successfully ordered by checking pending[c]. If pending[c] is not empty, the the authentication stage discards the client request message, otherwise it authenticates the request and generates a new authenticated request message hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O . This message is stored to lastSent[c] while the tuple hc, nc , opi is stored to persistent memory and pending[c]. The authentication stage processes new requests for c as they are received but limits retransmissions to occur at most once per 4000ms per client request. 98 1 2 lastSent[c] := ∅ \\ l a s t client 4 5 pending[c] := ∅ \\ r e q u e s t 7 8 9 commandCache{0,1,2} := ∅ \\ s e t o f o r d e e d commands i n d e x e d by c l i e n t c l i e n t request i d e n t i f i e r request validated for v a l i d a t e d , but n o t known to be o r d e r e d , p e r c l i e n t 11 base{0,1,2} := 0 \\ batch identifier 13 nf := 0 \\ e x p e c t e d n e x t b a t c h identifier 14 i := CP f interval 16 on rcv m = hclient-req, hreq-core, c, nc , opi, ciµ ~ n initial each mod 3 \\ l o g file index c,F : i f m.nc ≤ lastSent[c].nc then send lastSent[c] to O e l s e i f pending[c] = ∅ then lastSent[c] := hauth-req, hreq-core, c, nc , hash(op)iµ ~ 21 22 23 pending[c] := hc, nc , opi append pending[c] to a u t h e n t i c a t i o n l o g . i send lastSent[c] to O 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 49 50 51 52 53 54 55 56 57 58 and f o r commandCachei 17 18 19 20 25 identifier on rcv m = hbatch-complete, v, no , C, eiµ ~ e,F f,O , f iµ ~ f,O : i f m.no ≥ nf then if nf −1 < CPm.no CP interval interval mod 3 i := CPm.no interval ∧ no > basei + 3 × CP interval then g a r b a g e c o l l e c t commandCachei clear authentication log .i × CP interval to a u t h e n t i c a t i o n l o g . i append CPm.no interval ∀c do i f pending[c] 6= ∅ then append pending[c] to t h e a u t h e n t i c a t i o n l o g . i nf := m.no + 1 ∀b = hc, nc i ∈ C do i f b.nc ≥ lastSent[b.c].nc then i f pending[b.c].nc = b.nc then commandCachei .add(pending[b.c], no ) pending[b.c] := ∅ on rcv m = hfetch, no , c, nc , hash(op), eiµ ~ e,F : mod 3 k := CPm.no interval op := commandCachek .get(c, nc ).op i f hash(op) = hash(op) then send hcommand, no , c, nc , op, f iµf,e to e on r e c o v e r : ∀j ∈ {0, 1, 2} do basej := i n i t i a l s e q u e n c e number from a u t h e n t i c a t i o n l o g . j ∀m = hc, nc , opi ∈ a u t h e n t i c a t i o n l o g . j do i f pending[m.c] = ∅ ∨ pending[m.c].nc < m.nc then commandCachej .add(pending[m.c]) pending[m.c] = m lastSent[m.c] := hauth-req, hreq-core, m.c, m.nc , hash(m.op)iµ , f iµ ~ ~ f,O f,O else commandCachej .add(m) Figure 5.7: Pseudo-Code for the authentication stage to follow. 99 Processing hbatch-complete, v, no , C, eiµ~ e,F . The batch completed message is used to notify the authentication stage that the requests in the specified batch have been ordered. When the authentication stage receives a batch completed message it checks to see if it is safe to perform any garbage collection, performs any relevant garbage collection, and then performs the semantic processing of the message. Upon receipt of the batch completed message, the authentication stage compares the batch identifier no with the next batch it expects to be ordered nf . If no < nf , then the authentication stage proceeds directly to the semantic processing of the message. If no ≥ nf , then the authentication stage checks if a checkpoint interval boundary occurred between nf − 1 and no . If the authentication stage finds that a checkpoint boundary has occurred between nf −1 and no , i.e. nf −1 CP interval < no CP interval , then the authentication stage garbage collects the commandCache and transitions to a new log file. Independent of whether garbage collection is appropriate or not, the authentication stage updates nf to be no + 1. The semantic processing of batch completed messages is straightforward. The batch summary C contains one client id/request identifier tuple hc, nc i per request in the ordered batch no . For each request nc from client c is ordered in batch no , the authentication stage compares the identifier of the pending request for that client with nc . If the pending request identifier pending[c].nc = nc then the authentication stage moves the contents of pending[c] to commandCache[c, no ] and associates the request with batch identifier no . The pending request for c is then cleared as long as pending[c].nc ≤ nc . Note that the pseudo-code allows for gaps in the sequence of client request identifiers even though correct clients do not introduce gaps in their sequence of request identifiers. The abstract authentication stage should not allow for gaps in the sequence of client request identifiers, yet the pseudo-code and our description of the authentication stage does: why? The answer is simple: when the authentication stage is implemented by multiple replicas, it is impossible to ensure that all replicas receive every request without implementing protocol that requires more communication than is strictly necessary. Providing the mechanisms for handling gaps in the client request identifier sequence does not impact the behavior of the abstract (unreplicated) order stage in this discussion, but including those details now simplifies our detailed discussion of replicating the authentication stage in Chapter 6. 100 Processing hfetch, no , c, nc , hash(op), eiµ~ e,F . Processing the fetch body mes- sage is straightforward. Upon receipt of a fetch body message, the authentication stage pulls the request body stored in commandCache[c, nc . If the body is consistent with the request hash hash(op) in the message, then the authentication stage sends the body to the execution stage. Recovery. Following a transient failure, the authentication stage populates the pending and commandCache sets from the authentication log.* files. It populates the lastSent set with authenticated request messages corresponding to the contents of the pending set. Additionally, following recovery from a transient crash, the authentication stage explicitly delays garbage collection until it has observed ordered batches that span at least 4 checkpoint intervals. 5.7.3 Order operation The order stage is responsible for placing authenticated requests into batches and assigning an execution order to the batches. Pseudo-code for the order stage is shown in Figure 5.8. We describe the operation of the order stage by detailing the state maintained and the processing of each of the four messages it receives from other participants in the system. Data structures. The lastOrdered set records the client request identifier nc of the last request ordered for each client nc . The B is an incomplete batch of requests that has not yet been communicated to the execution stage. The cached log contains between CP interval and 2 × CP interval consecutive ordered batches. The log is broken up into two pieces; at any point in time either cached0 or cached1 contains CP interval consecutive ordered batches. In addition to the three sets described above, the order stage maintains the identifier no of the next batch to be ordered, the identifier of the current base checkpoint nCP , the current time (defined as the time associated with the previously ordered batch), and a binary index ind. Processing hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O . The primary task of the order stage is creating batches of one or more authenticated request and as101 1 2 3 4 5 6 7 8 9 11 lastOrdered := ∅ \\ l a s t r e q u e s t i d e n t i f i e r o r d e r e d f o r e a c h c l i e n t cached{ 0, 1} := ∅ \\ l o g o f n e x t b a t c h m e s s a g e s i n d e x e d by o r d e r b a t c h B := ∅ \\ b a t c h o f r e q u e s t s no := 0 \\ b a t c h i d e n t i f i e r o f t h e n e x t b a t c h to be o r d e r e d nCP := 0 \\ b a t c h i d e n t i f i e r o f t h e b a s e c h e c k p o i n t baseCP := ∅ \\ b a s e c h e c k p o i n t secondaryCP := ∅ \\ s e c o n d a r y c h e c k p o i n t time := 0 \\ l a s t b a t c h t i m e ind := 0 \\ c u r r e n t l o g i n d e x on rcv hauth-req, hreq-core, c, nc , hash(op)iµ ~ f,O , f iµ ~ f,O : 12 13 i f lastOrdered[c] ≥ nc then send hretransmit, c, no , oiµ ~ 14 15 16 17 18 19 20 21 return i f lastOrdered[c] + 1 = nc then lastOrdered[c] := nc B ∪=hc, nc , hash(op)i i f B i s f u l l then time = System.time t := htime, randomi cachedind [no ] := hnext-batch, v, no , H, B, t, bool, oiµ ~ o,E to E 22 append cachedind [no ] to o r d e r l o g . ind 24 25 26 27 28 29 30 31 32 33 34 send cachedind [no ] to E no := no + 1 i f no mod CP interval = 0 then i f 6 secondaryCP.isStable() then w a i t f o r secondaryCP.isStable() nCP := no − CP interval ind := (ind + 1) mod 2 g a r b a g e c o l l e c t cachedind and baseCP c l e a r o r d e r l o g . ind and o r d e r C P . ind baseCP := secondaryCP secondaryCP := t a k e o r d e r c h e c k p o i n t 36 37 38 39 40 41 42 43 44 on rcv m = hcp-token, v, no , Tcp ie µ ~ e,O : i f m.no 6= secondaryCP.no then d i s c a r d m e s s a g e and r e t u r n i f secondaryCP.hasExecCP () then d i s c a r d m e s s a g e and r e t u r n secondaryCP.addExecCP (Tcp ) c l e a r o r d e r C P . ind w r i t e secondaryCP to o r d e r C P . ind secondaryCP.makeStable() 46 on rcv hlast-exec, ne , eiµ ~ 47 48 49 50 51 53 54 55 56 e,O identifier o,E : i f nCP ≤ ne < no then ∀i ∈ [ne , no ) send cached[i] to e i f ne < nCP then send hload-cp, Tcp , nCP , oiµo,e to e on r e c o v e r : l o a d cachedi from o r d e r l o g . i l o a d minimal c h e c k p o n i t i n o r d e r C P . 0 o r o r d e r C P . 1 a s b a s e c h e c k p o i n t r e p o p u l a t e r e m a i n i n g v a r i a b l e s by r e p l a y i n g c o n t e n t s o f cached0 and cached1 Figure 5.8: Pseudo-Code for the order stage to follow. 102 signing those batches an order. Upon receipt of a authenticated request, the order stage first checks if the request has already been placed in a batch; if the request has been placed in a batch, then the order stage instructs the execution stage to retransmit the last response to the issuing client c and returns. If the request has not yet been ordered and it corresponds to the next request in sequence for client c, then it is added to the pending batch of requests B and the lastOrdered record for c is updated to nc . If the batch B is sufficiently full, then the order stage sets the tuple t to contain the system time to be used when executing the batch of requests and a random PRNG seed. The order stage ensures that time increases with successive ordered batches, i.e. if n′o > no then t′ > t. The batch B and tuple t are placed in ordered batch message no . The next batch message is added to cachedind and appended to ordered log.ind before it is sent to the execution stage. The next batch identifier to be used is incremented by 1, no := no + 1 and the order stage waits for the next request to arrive. After incrementing the next batch identifier, the order stage checks to see if it has reached a checkpoint interval. If no mod CP interval = 0, then the order stage has arrived at a checkpoint interval and it is time for garbage collection. The order stage waits for the checkpoint at no −CP interval to become stable, at which point it (1) increments ind, (2) designates checkpoint no − CP interval as the new base checkpoint, (3) garbage collects cachedind and the old base checkpoint, (4) clears files associated with garbage collected state, and (5) generates a secondary checkpoint at no . Processing hcp-token, v, no , Tcp ie µ ~ e,O . The order stage receives CP messages from the execution stage that contain a token Tcp describing the execution checkpoint at no . Upon receipt of the execution checkpoint token, the order stage adds Tcp to the order checkpoint at no and stores the order checkpoint to order CP.ind. Processing hlast-exec, ne , eiµ~ e,O . The execution stage sends last executed mes- sages when it detects that the network is not behaving reliably due to dropping/delaying/reordering a subset of next batch messages. When the order stage receives a last executed message, it responds with the ordered batches with identifiers in the range ne to no exclusive. If the last batch executed by the execution stage ne is smaller than the base checkpoint maintained by the order stage baseCP , then the order stage instructs 103 the execution stage to load the execution checkpoint described by Tcp stored in order checkpoint baseCP . This scenario seems far fetched, but can occur when the execution stage suffers a transient crash. The scenario can also occur during normal operation when the execution stage is replicated because the network (and faulty order replicas) cannot be relied on to deliver messages to all order replicas in a timely fashion. Recovery. Following recovery from a transient crash, the order stage sets the base checkpoint to be the earliest checkpoint contained in order CP.0 or order CP.1. The order stage then reads the contents of order log.i into cachedi for i ∈ {0, 1} and updates lastOrdered to be consistent with the logged next batches as described in Section 5.3.1. 5.7.4 Execution operation The execution stage is responsible for delivering batched requests to the application in the linearized order specified by the order stage and relaying the response to each request to the client that issued the request. Pseudo-code for the execution stage is shown in Figure 5.9. We describe the operation of the execution stage by detailing the state maintained and the processing of each of the four messages it receives from other participants in the system. Data structures. The execution stage maintains a replyCache of the most recent response sent to each client. It also maintains a collection of sets of batchCommands. Each set in batchCommands corresponds to the set of request bodies specified for an ordered batch and is augmented by the designated system time and PRNG seed. The execution stage additionally maintains an identifier ne of the next batch to be executed. Processing hnext-batch, v, no , H, B, t, bool, oiµ~ o,E . When the execution stage receives an ordered batch, it first compares the batch identifier no with the identifier ne of the next unexecuted batch in the sequence. If no < ne then the execution stage discard the next batch message. If no ≥ ne , then it notifies the authentication stage that the batch is complete and fetches the bodies of all requests contained in the batch. 104 1 2 3 4 ne := 0 \\ i d e n t i f i e r o f t h e n e x t b a t c h to e x e c u t e replyCache := ∅ \\ l a s t r e p l y s e n t to e a c h c l i e n t batchCommands := ∅ \\ p e r b a t c h s e t o f r e q u e s t b o d i e s , state := ∅ \\ a p p l i c a t i o n s t a t e 6 on rcv m = hnext-batch, v, no , H, B, t, bool, oiµ ~ 7 8 9 10 11 12 13 14 15 16 17 o,E : e,F ∀hc, nc hash(op)i ∈ B send hfetch, no , c, nc , hash(op), eiµ ~ e,F to F on rcv hcommand, no , c, nc , op, f iµf,e : batchCommands[no ].add(c, nc , op) w h i l e batchCommands[ne ].isComplete do hstate, responsesi := app.exec(state, C[ne ]) batchCommands[ne ] := ∅ ∀r = hc, nc , Ri ∈ responses replyCache[r.c] := hreply, r.nc , r.R, H, e, iµe,r.c send replyCache[r.c] to r.c ne := ne + 1 i f ne mod CP interval = 0 then CPapp := app.takeCP (state) CPexec := t a k e e x e c u t i o n c h e c k p o i n t ne CPexec .setAppCP (CPapp ) r e c o r d CPexec to exec CP . ne Tcp := hash(CPexec ) ∀i ≤ ne − 2 × CP interval : ∃ exec CP . { ne } do d e l e t e exec CP . ne send hcp-token, ne , Tcp , eiµ ~ e,O to O 38 on rcv hretransmit, c, no , oiµ ~ o,E : 39 40 i f no ≥ ne then send hlast-exec, ne , eiµ ~ 41 42 43 44 45 46 47 i f no + 1 = ne then ), Tcp , eiµ send hcp-token, CP interval × ( CP ne ~ e,O interval i f no < ne then C := ∅ ∀m = hreply, nc , R, H, e, iµe,c ∈ replyCache do C ∪=hm.c, m.nc i send hbatch-complete, v, ne − 1, C, eiµ to F ~ 48 send replyCache[c] to c 50 51 52 53 54 55 identifier i f m.no < ne then d i s c a r d and r e t u r n e l s e i f m.no ≥ ne then batchCommands[m.no ].setT imeAndP RN G(t) batchCommands[m.no ].setCommands(B) C := ∅ ∀b = hc, nc , hash(op)i ∈ B do batchCommands ∪=hb.c, b.nc i to F send hbatch-complete, m.v, m.no , C, m.eiµ ~ 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 19 k e y e d by b a t c h e,O to O to O e,F on rcv hload-cp, Tcp , no , oiµo,e : CPexec := l o a d c h e c k p o i n t from exec CP . no i f hash(CPexec ) = Tcp then l o a d CPexec state := app.loadCP (CPexec .getAppCP ()) s e t ne = no Figure 5.9: Pseudo-Code for the execution node to follow. 105 Note that notifying the authentication stage that the batch is complete and fetching request bodies from the authentication stage are actions based on sending two distinct messages. Using two distinct messages is unnecessary at the intra-stage level of the protocol, but becomes an important concern when the authentication and execution stages are replicated—while it is certainly sufficient for every authentication replica to send every request body to every execution replica, it is not necessary. Using the two distinct messages, one to notify the authentication stage that requests have been ordered and the other to explicitly fetch the requests allows us to limit the number of times a request body is sent over the network. We discuss these concerns in more detail in Chapter 6. Processing hcommand, no , c, nc , op, f iµf,e . Upon receipt of a request body mes- sage, the execution stage adds the body to the set of bodies it has gathered for batch no . Batch no is complete if the execution stage has bodies for every request in the batch. After adding a body to the batch, the execution stage checks if batch ne , the first batch it has not yet executed, is complete. If batch ne is complete, then the execution stage delivers the batch (including time and PRNG) to the application, stores the responses in the replyCache and sends the responses to the appropriate clients before incrementing ne by one. The authentication stage continues this process until batch ne is not complete, either because the execution stage is missing one or more request bodies or because the execution stage has not yet received the ordered batch ne from the order stage. Before executing batch ne %CP interval , the execution stage takes a checkpoint of the execution state, records the checkpoint to persistent memory, and sends a token Tcp describing the checkpoint to the order stage. Processing hretransmit, c, no , oiµ~ o,E . The retransmission message contains two important fields, the client c that needs a response and the next batch identifier no that will be used by the order stage. Upon receipt of a retransmission message, the execution stage sends client c the last response stored in the replyCache for c. Additionally, if no is at least the next unexecuted batch identifier ne then the execution stage notifies the order stage that it has not yet executed any batch with identifier n′o ≥ ne and waits for the appropriate batches to be retransmitted; if no + 1 = ne , then the execution stage sends the execution stage checkpoint to the 106 order stage; if no < ne then the execution stage sends a special batch completed message to the authentication stage. This message asserts that every request with a response in the replyCache is ordered in batch ne − 1 (or earlier). Processing hload-cp, Tcp , no , oiµo,e . When the execution stage receives a load checkpoint message, it loads the execution checkpoint described by Tcp . We implement the checkpoint token Tcp as a hash of the byte representation of the execution checkpoint. Recovery. The execution stage does not do anything to recover from transient crashes. It simply begins operation as if it is starting from a fresh slate and waits for messages form the order stage. 5.8 Conclusion This chapter describes the interactions between correct stages in the UpRight library. Because we assume the stages are correct, and not ideal, the interactions between correct stages accounts for the possibility that clients or the network may be faulty, the reality that network and storage resources are finite, and the threat of transient power outages. The interactions between stages define the properties that each stage must fulfill in order to reliably replicate an application. We discuss the challenges of replicating each stage to sustain the requisite properties despite failures in the next chapter. The replicas of each stage must be deployed on separate machines to receive the benefits of fault tolerant replication. Although the stages are described as logically separate entities, the replicas implementing each stage need not be physically separate. For example, one machine can host an authentication and order replica while another hosts an authentication and execution replica while yet another machine hosts only an execution replica. 107 Chapter 6 UpRight Replication Chapter 5 describes the interactions between the stages of the UpRight architecture assuming that each replicated stage provides the abstraction of a single correct machine. In this chapter we focus on the intra-stage protocols required to discharge the assumption that an individual stage is correct. We consider a replicated stage to be correct if it is up—i.e. ensures the liveness properties specified in Chapter 5— despite at most u arbitrary failures; and right—i.e. ensures the safety properties specified in Chapter 5—despite at most r commission failures. When discussing the replication of each stage, there are three key questions that we must address. How do the replicas within a stage coordinate with each other? How does replicating one stage impact the other stages? How many replicas are required to implement each stage? The answer to all three questions is closely tied to the challenge of solving asynchronous consensus. The combination of stages described in Chapter 5 can be viewed as a sequence of three consensus protocols. In the first instantiation of consensus, clients propose requests, the authentication accepts and authenticates the requests, and the order stage learns the authenticated requests. In the second instantiation of consensus, the authentication stage proposes authenticated requests, the order stage accepts and orders the requests in batches, and the execution stage learns the ordered request batches. In the third instantiation of consensus, the order stage proposes ordered batches of requests, the execution stage accepts the ordered batches and executes them in order, and the clients learn the results of executing requests in the sequence of ordered batches. 108 Stage Authentication Order Execution Replication requirements u + max{u, r} + r + 1 2u + r + 1 u + max{u, r} + 1 Table 6.1: Summary of stage-level replication requirements. We consequently base our design and implementation of each stage on consensus protocols. Although consensus is a well known and extensively studied problem, it is important to note that consensus protocols are not created equal. Both the replication requirements and the acceptor-acceptor and proposer-acceptor-learner communication are influenced by (a) the identity and number of proposers and learners, (b) the desired number of communication steps between proposing and learning, and (c) the semantics of the values being learned. While 2u + r + 1 (i.e., 3f + 1 when f = u = r) replicas are generally sufficient to solve consensus, there are interesting configurations (e.g., a single unfailing proposer) that require fewer acceptors (u + r + 1). Similarly, there are semantics (e.g., a proposed value must be one-step transferable and authenticated via MACs) that may require additional acceptors (u + 2r + 1) [32, 60, 56, 68]. Table 6.1 shows the replication requirements for the authentication, order, and execution stages. The order stage requires the standard 2u + r + 1 replicas. The execution stage requires fewer replicas—u + max{u, r} + 1 to be precise. This number is impacted by two considerations that will be explored in this chapter: (1) the order stage acts as a single unfailing proposer and (2) execution checkpoints require indirect learning. Indirect learning occurs any time a hash of data, rather than the data itself, is passed from one stage to another. The authentication stage instead requires u + max{u, r} + r + 1 replicas. The core replication requirements of the authentication stage are based on the same factors as the execution stage, with the additional requirement that learned values must be one-step transferable. In the rest of this chapter we expand on the design of each stage and the specifics of the consensus problem that each stage solves. In Section 6.1 we discuss relevant background on asynchronous consensus, paying specific attention to environments that do not require the standard 2u + r + 1 acceptors. In Section 6.2 we discuss the mapping of the order stage to consensus and thedetails of implementing a replicated order stage with 2u + r + 1 replicas. We begin the discussion of repli109 cated stages with the order stage because it is most similar to previous work. In Section 6.3 we discuss the mapping of the execution stage to consensus and details of implementing a replicated execution stage with u + max{u, r} + 1 replicas. In Section 6.4 we discuss the mapping of the authentication stage to consensus and the details of implementing a replicated authentication stage with u + max{u, r} + r + 1 replicas. In Section 6.5 we present microbenchmark experiments that explore the performance characteristics of our prototype implementation of the UpRight library. In Section 6.6 we discuss the costs and benefits of maintaining the logical separation when stages are replicated. 6.1 Consensus background Recall that we briefly introduced Paxos style consensus [53, 54, 56] in Chapter 2 and subsequently used consensus as a concrete foundation for discussing the UpRight failure model. To recap, the participants in a consensus protocol are divided into three categories based on their role in the system. Proposers propose values, acceptors accept proposed values, and learners learn accepted values. A consensus protocol is correct if its safety properties hold despite up to r commission failures and its liveness properties hold despite up to u total failures. The three consensus safety properties are: (1) only proposed values are accepted, (2) at most one value is accepted, and (3) non-faulty learners only learn accepted values. The single consensus liveness property is: if a non-faulty proposer proposes a value during a synchronous interval then non-faulty learners eventually learn a value. We highlight consensus at this point in the thesis for two reasons. First, because RSM protocols are traditionally based on consensus protocols, it is important to understand consensus before discussing the details of replication protocols. Second, the number of replicas required to implement consensus varies with (a) the configuration of proposers, acceptors, and learners, (b) the targeted communication steps, and (c) the semantics of the values being learned. Consensus and state machine replication. RSM protocols including PBFT [18], Paxos [53], Zyzzyva [49], and many others are traditionally built around repeatedly executing a consensus protocol that is used to order requests for processing. During 110 the ith instance of consensus, the replicas assign the sequence number i to a request. The replicas then execute the requests in the specified order and relay the result of executing the request to the client that issued the request before moving on to the i + 1st instance of consensus. When the client receives a response, it explicitly learns the result of executing its request as the ith request in the sequence and implicitly learns the relevant impact of the previous i − 1 requests. Consensus replication requirements. As mentioned in the previous section, the number of acceptors (i.e. replicas) required to implement consensus is not always 2u + r + 11 . While 2u + r + 1 replicas are generally sufficient to solve asynchronous consensus, this number can increase or decrease based on the specific semantics and configuration of replicas for that instance of consensus [32, 60, 56, 68] If there is exactly one proposer and that proposer cannot fail, then u + r + 1 replicas are sufficient to solve consensus [32, 60, 56]. The intuition for the reduced costs is straightforward. First, a single proposer that cannot fail can be trusted to send the same messages, in the same order and with appropriate message identifiers, to each acceptor. Second, a correct acceptor can be trusted to accept messages in the order they were sent by the proposer. Third, a learner that receives matching values from r + 1 distinct acceptors knows that at least one acceptor is non-faulty; it can then be confident that no other value will be accepted for that sequence number and proceed to learn the value. Note that this implies that r + 1 replicas are sufficient to ensure the safety properties. An additional u replicas are required to ensure that a value can be learned, i.e., that a quorum of r + 1 correct replicas exists, despite up to u total failures. The semantics of the value being learned can increase the replication requirements. Continuing to consider a single correct proposer, we explore two specific semantics of learned values—indirect learning, i.e. a hash of a value is learned rather than the value itself, and one-step transferable with MAC authentication, i.e. the initial learner can teach other learners a value authenticated with MACs. In the context of indirect learning, the learner can be sure that the hash is correct after receiving the value from r + 1 distinct acceptors but has no assurance that the actual value will be available for use in the future. If it is important that the value itself be fetchable in the future, then the learner must receive the hash from 1 Recall that 2u + r + 1 is equivalent to 3f + 1 when u = r = f . 111 u + 1 distinct acceptors. Combining these two concerns, the learner must receive the hash from max{u, r} + 1 distinct acceptors to be sure that the hash is correct and the underlying value is fetchable. Ensuring that indirect learning is always possible requires an additional u replicas for a total of u + max{u, r} + 1. One-step transferability requires any correct learner a that learns a value directly from the acceptors to teach correct learner b that same value. In order for a to teach a learned value to b, a must provide b with the value and sufficient proof for b to believe that the value was accepted. With a single correct proposer, a can learn a value when it receives the value from r + 1 distinct acceptors. If the messages are authenticated using public key cryptography, then a can pass the r + 1 signatures and the value to b and know that b, if correct, will also learn the value. If, however, messages are authenticated with MAC authenticators, then a cannot be sure that b will successfully authenticate all r + 1 MAC authenticators and recognize that the value has been accepted—some subset of the acceptors may be faulty. If a receives the value from 2r +1 replicas, on the other hand, it can provide the value and the set of 2r + 1 MAC authenticators to b and know that b will successfully authenticate at least r + 1 authenticators and subsequently learn the value. An additional u replicas are required to ensure that a can always teach a learned value to b2 . Note that matrix signatures [3] provide a general mechanism for implementing one-step transferability. The discussion here clarifies the relationship between consensus and matrix signatures. We initially developed matrix signatures in the traditional context of f Byzantine failures and observed that matrix signatures, like standard consensus, require 3f +1 replicas. We now see that matrix signatures solve a specialized consensus problem with a replication requirement of u + 2r + 1 replicas that differs from the traditional 2u + r + 1. 6.2 Replicated order stage The order stage is responsible for placing requests in batches and selecting a linearized batch order. The order stage also tracks the recent execution-stage checkpoints used as part of (1) keeping state at each stage finite and (2) recovering from 2 We note that one-step transferability is the core property provided by digital signatures in Chapter 3.3. 112 stage level transient crashes. We view both of these activities as instances of consensus, where the order stage acts as the acceptors. We refer to the first instance as normal-operation and the second as checkpoint-operation. During normal-operation, the authentication stage proposes authenticated requests to the order stage, the order stage accepts the requests by placing them into batches and assigning an order to the batches, and the execution replicas learn the linearized sequence of request batches. Note that because the authentication stage acts as a proxy for clients that actually issue requests and does not provide any crossclient coordination, we view it as acting as multiple proposers—one proposer per client3 . Thus, the normal-operation consensus problem corresponds to the standard configuration with multiple proposers and multiple learners. During checkpoint-operation, the execution stage proposes execution checkpoints to the order stage, the order stage accepts and stores the checkpoint, and the execution replicas may (or may not) learn the checkpoint. Note that the execution stage acts as a single proposer while individual execution replicas learn the checkpoint. The rest of this section details our design for the replicated order stage. Section 6.2.1 describes the Zyzzyvark protocol, a PBFT-like [18] consensus protocol, used for normal-operation. Section 6.2.2 describes our approach to piggybacking checkpoint-operation onto Zyzzyvark’s internal checkpointing mechanisms. Section 6.2.3 describes how the inter-stage messages sent to and from the order stage fit into our design and how replicating the order stage impacts the behavior of the authentication and execution stages. Section 6.2.4 describes how the replicated order stage fulfills the properties of a correct order stage described in Chapter 5. 6.2.1 Normal-operation—Zyzzyvark Normal-operation maps to a standard configuration of consensus with multiple proposers and multiple acceptors that is comparable to the consensus problem solved by the PBFT lineage of RSM protocols [18, 24, 26, 50, 49, 92, 104, 107]. We could, in principle, use a protocol like PBFT [18], Zyzzyva [49], or Aardvark [24] as the basis for the order stage. We instead rely on a new replication protocol called Zyzzyvark. We do not introduce any fundamentally new ideas or insights in the design of 3 Semantically, that the authentication stage asserts “Client c said X”, not “Client c said X before client c′ said Y.” 113 Zyzzyvark. We instead combine key ideas from previous protocols to get a simple and robust protocol design. Zyzzyvark, like its predecessors, is based on three subprotocols: agreement, checkpointing, and view change. The agreement subprotocol is used to batch and order requests. One replica is designated the primary of the current view and is responsible for leading the replicas through a standard three phase commit protocol to agree on the order and contents of request batches. The checkpoint subprotocol is used to coordinate checkpoints across replicas and allow the garbage collection of old batches and checkpoints. The view change protocol is used to replace the current primary and transition to a new view led by a new primary. As part of transitioning to a new view v, the view change protocol must ensure that the starting state for view v reflects all batches ordered in previous views v ′ < v. In this section we provide an overview of how the Zyzzyvark protocol works and highlight the ways in which Zyzzyvark differs from previous protocols. We refer readers interested in technical proofs and detailed description of agreement and view change protocols to PBFT [18] and Zyzzyva [49]. Replication requirements and quorum size Previous protocols have been designed to be safe and live despite up to f Byzantine failures. We design Zyzzyvark to be safe despite up to r commission failures and live despite up to u Byzantine failures. Translating a protocol described in the language of traditional Byzantine fault tolerance to the language of UpRight fault tolerance is relatively straightforward, requiring only the relabeling of quorum sizes in the system. We identify three distinct quorum sizes as small, medium, and large quorums. In most systems, the protocols are described with the explicit assumption that the minimum 3f + 1 replicas are used. In that context, small quorums have size f + 1 and correspond to the smallest quorum guaranteed to contain at least one correct replica; medium quorums have size 2f + 1 and correspond to the largest quorum that a replica can wait for without endangering liveness; large quorums have size n and contain every replica. Translating these quorums definitions to UpRight is straightforward: a small quorum has size r + 1, a medium quorum has size n − u, and a large quorum has size n. 114 Agreement We begin by describing the PBFT-like agreement protocol that is the core of Zyzzyvark. The protocol begins when the authentication stage, on behalf of an authorized client c, sends an auth-req message to the designated primary order replica. The primary adds the authenticated request contained in the auth-req message to a batch; if the batch is “full” or sufficient time has passed since the last batch was formed, the primary sends a pre-prepare message containing the finalized batch to the other replicas. Each replica verifies that the batch is well-formed—that is that the batch identifier is the next in sequence, that the time associated with the batch is larger than the time associated with the previous batch, and that all requests in the batch are (a) issued on behalf of an authorized client, (b) the next in sequence for that client, and (c) have not been placed in a previous batch). If the batch is well-formed then the replica sends a prepare message to the other replicas. Upon receipt of a medium quorum of matching prepare messages each replica sends a commit message to the other replicas. Upon receipt of a medium quorum of matching commit messages each order replica sends a next-batch message to the execution stage, notifying the execution stage that the batch has been ordered. The execution stage accepts the batch as ordered when it receives a small quorum of matching notifications. This basic communication pattern is employed by PBFT [17] and is shown in Figure 6.1. A straightforward optimization of the basic pattern described above is tentative agreement [18]. Under tentative agreement, the replicas send a tentative batch (tent-batch) to the execution stage after receiving the quorum of prepare messages as shown in Figure 6.2. The execution stage accepts the batch as ordered upon receipt of a medium quorum of matching tent-batch messages. The primary contribution of the Zyzzyva work is speculative ordering [49]. When speculative ordering is employed, replicas notify the execution stage that a batch is speculatively ordered after receiving the pre-prepare message from the primary as shown in Figure 6.3. The execution stage accepts a batch as ordered when it receives a large quorum of spec-batch messages. Enabling speculative ordering requires the order replicas to agree not just on the contents of the next batch, but also on the history of batches that have been ordered—replicas only accept a pre-prepare message if (a) the batch is well formed, (b) the batch is specified as the next batch in the 115 Valid Authentication Request Stage Preprepare Prepare Commit Complete Ordered Batch Primary Replica Order Stage Replica Replica Execution Stage r+1 Figure 6.1: Basic communication pattern for complete agreement. sequence, and (c) the history H contained in the pre-prepare message summarizes the sequence of batches that the replica has observed. The Zyzzyvark protocol makes use of both speculative and traditional ordering. By default, the protocol relies on speculative ordering, and replicas do not exchange prepare or commit messages. The primary can, however, initiate the traditional three phase order protocol at any time. This may be appropriate and/or necessary if the primary believes that another replica is faulty, or if one replica has requested a view change but the other replicas have not yet joined in the insurrection. Note that one benefit of including the current batch history with every batch is that committing (i.e. gathering a quorum of commit messages) a batch no with history H implies that all batches n′o < no whose histories are prefixes of H are also committed. This observation extends to the execution stage processing of nextbatch messages. Recall that the execution stage waits for n speculative next-batch messages, n − u tentative next-batch messages, or r + 1 committed next-batch messages for each batch and that each next batch message contains a batch and the 116 Valid Authentication Request Stage Preprepare Prepare Tentative Ordered Batch Primary Replica Order Stage Replica Replica Execution Stage n-u Figure 6.2: Basic communication pattern for tentative agreement. Valid Authentication Request Stage Speculative Ordered Batch Preprepare Primary Replica Order Stage Replica Replica Execution Stage n Figure 6.3: Basic communication pattern for speculative agreement. 117 history up to that batch. When the execution stage receives sufficient next batch messages to confirm batch no with history H, it implicitly commits batch no − 1 with history H′ provided that (a) H′ is the immediate prefix of history and (b) the execution stage has received at least one next batch message (complete, tentative or speculative) for n′o with history H′ . Further refining failure counts. Several authors have noted [32, 56, 68, 92] that it is possible to provide speculative ordering even when failures occur. These systems introduce a new qualification to the UpRight goals: up despite at most u Byzantine failures, right despite at most r commission failures, and fast despite at most e Byzantine failures. We do not explore the specifics of fast ordering but observe that this work is complementary and can be incorporated into the order stage. Note that specifying fast failures exposes the true size of large quorums as n − e and that a minimum of max{2e + u + 2r + 1, 2u + r + 1} acceptors are always sufficient for fast consensus4 . Note that the protocols sketched above implicitly have e = 0. Message authentication. We rely on MACs to authenticate all messages ex- changed as part of the Zyzzyvark protocol. Faulty client requests. Note that Zyzzyvark neither relies on signatures for client request authentication (Section 3.3) nor requires special handling for inconsistently authenticated client requests (Section ??). We rely on the one-step transferable property of requests authenticated by the authentication stage to preemptively solve the problem. Checkpoint management As discussed in Section 5.3, the order stage maintains a base checkpoint, a secondary checkpoint, and a log of between CP interval and 2 × CP interval batches ordered since the base checkpoint. The discussion in Section 5.3 focused on the definition of the order checkpoints and stage-level maintenance. In this section, we focus on how the order replicas coordinate to ensure that they each maintain a consistent order 4 The familiar caveat that there exists specific configurations that require fewer acceptors applies. 118 checkpoint. Note that the checkpoint management discussed here is distinct from the checkpoint-operation to be discussed in Section 6.2.2 Before we get into the details of how Zyzzyvark replicas coordinate on orderstage checkpoints, it is important to note that checkpoint generation and garbage collection is a standard part of previous replication libraries such as PBFT [18], Zyzzyva [49], and Aardvark [24]. The checkpoint coordination in Zyzzyvark differs from its predecessors in two important ways. First, Zyzzyvark checkpointing (and by extension the order-stage checkpoints) are comparatively conservative: previous protocols ensure that each replica has one or two checkpoints and a log of at most 2 × CP interval batches since the oldest checkpoint, while Zyzzyvark guarantees that each replica always maintains two checkpoints and a log of between CP interval and 2×CP interval requests since the oldest checkpoint. Second, previous systems rely on a distinct protocol for checkpoint coordination while Zyzzyvark piggybacks checkpoint coordination onto normal operation. Zyzzyvark piggybacks checkpoint coordination onto the agreement protocol that the system runs during normal-operation. The primary augments the preprepare message for batch (no + 1) mod CP interval = 0 with the order-stage checkpoint for no − CP interval and the replicas perform the traditional three phase agreement on this batch. The batch no is not ordered, i.e. the pre-prepare containing no is neither sent by the primary nor processed by a replica, until the replica gathers a medium quorum of commit messages for no − 1. Once no − 1 is committed, a replica can safely garbage collect checkpoint no − 2 × CP interval and all batches n′o < no − CP interval . At the same time, the order replica generates a new checkpoint before considering the pre-prepare message for batch no . View change Zyzzvyark, like PBFT and Zyzzyva, operates in “views.” During a view v, replica v mod |replicaCount| is the designated primary. The view-change protocol is used to elect a new primary and determine the starting state for the next view v +1; in order for the system to remain consistent the new view must reflect all batches that were successfully ordered in the previous view. The Zyzzyvark view-change protocol uses standard techniques developed in PBFT [18] and Zyzzyva [49]. Zyzzyvark adopts the adaptive view-change triggers discussed in Chapter 3. Specifically, a replica 119 initiates a view-change when (a) the throughput in the current view drops below a constantly increasing threshold, (b) too much time passes between pre-prepare messages, (c) the primary commits a detectable commission failure (e.g., attempts to include an invalid batch in a pre-prepare message), or (d) a small quorum of other replicas initiate a view change. 6.2.2 Checkpoint-operation Checkpoint-operation refers to the transfer of execution-stage checkpoints to and from the order stage and is conceptually distinct from the internal Zyzzyvark checkpointing discussed in the previous section. Checkpoint-operation is conceptually simple: the execution stage proposes an execution-stage checkpoint to the order stage, the order stage accepts the checkpoint, and individual execution replicas learn the agreed upon checkpoint. Rather than implement another consensus protocol with the order replicas, we map checkpointoperation onto the existing Zyzzyvark internal checkpoint mechanism. We piggyback the checkpoint consensus protocol onto the Zyzzyvark checkpointing mechanism described above. The execution stage proposes an executionstage checkpoint by sending a cp-token message containing the execution checkpoint to each order replica. The order replicas add the execution-stage checkpoint to the corresponding order-stage checkpoint. At designated points in the sequence of ordered batches, the primary includes the checkpoint in the pre-prepare message and the checkpoint is subsequently agreed upon by the order replicas using the full three phase agreement path only if the checkpoint contained in the pre-prepare message matches the checkpoint stored at each non-faulty order replica. Execution replicas can subsequently learn an execution-stage checkpoint after receiving a small quorum of r + 1 matching load-cp messages from the order stage. Note that because the execution stage consists of execution replicas (i.e. the proposer is the learners), the execution replicas generally only learn the value explicitly when recovering from a transient crash or catching up following an asynchronous interval. The communication pattern for the checkpoint consensus protocol is shown in Figure 6.4. 120 Execution Checkpoint Preprepare Prepare Commit Execution Checkpoint Primary Replica Order Stage Replica Replica Replica r+1 Execution Stage Replica Figure 6.4: Basic communication pattern for the order stage checkpoint consensus protocol. Note that while the execution stage acts as a single proposer, each individual replica is a distinct learner. In the context of the UpRight library, learning is done only when a network or node failure occurs. 121 Message auth-req spec-batch tent-batch comp-batch cp-token load-cp last-exec retransmit Consensus Instance normal Consensus Semantics propose request normal learn ordered batch checkpoint checkpoint both both propose checkpoint learn checkpoint utility — missed learning utility — should have learned Table 6.2: Consensus semantics for messages related to the order stage. Each proposal or learn message is part of a single consensus instance. The utility messages are used by both consensus protocols. 6.2.3 Interactions with other stages Replicating the order stage impacts how the authentication and execution stages process messages from the order stage and how they send messages to the order stage. To understand these changes, we must first put the intra-stage messages in the context of the normal and checkpoint consensus protocols. There are a total of three intra-stage messages sent to the order stage and three-intra stage messages sent by the order stage. In Table 6.2 we divide these messages into three categories based on which consensus protocol the message is related to: normal-operation, checkpoint-operation, or both. The auth-req messages sent by the authentication stage are the proposals for normal-operation and the next-batch messages sent to the execution stage are the corresponding learning messages. The cp-token messages sent by the execution stage are the proposals for checkpoint-operation and the load load-cp messages sent to specific execution replicas are the corresponding learning messages. The retransmit messages sent to the execution stage and last-exec messages sent by individual execution replicas are utility messages indicating that something should have been learned or something was not learned respectively. The utility messages are used to ensure that an asynchronous network does not prevent the learners from learning accepted values. The first time that the authentication stage authenticates client request nc from client c, it sends an auth-req message containing request nc to the current primary. On subsequent retransmissions of the request, the authentication stage sends the auth-req message to every order replica. Note that the first send results 122 in the request being ordered by the order stage (unless the primary is faulty or the network is ill-behaved) while subsequent sends trigger the retransmission process, notifying the execution stage that it (a) should retransmit any cached response for client c and/or (b) has missed one or more batches. There are three distinct types of next-batch messages sent to the execution replicas: speculative, tentative, and complete. An execution replica learns that a batch has been ordered only after gathering an appropriately-sized quorum of next-batch messages: a large quorum of n speculative next-batch messages, a medium quorum of n − u tentative next-batch messages, or a small quorum of r + 1 complete next-batch messages. Execution replicas act on an ordered batch only once the batch has been learned, i.e., after receiving the appropriately sized quorum of matching next-batch messages. The execution stage sends cp-token messages to every order replica. Each order replica independently places the contained execution checkpoint in its local order checkpoint before participating in Zyzzyvark’s checkpointing protocol. An execution replica learns that an execution checkpoint should be loaded when it receives a small quorum of r + 1 load-cp messages. The small quorum is sufficient because checkpoint-operation relies on the three-phase-commit of the Zyzzyvark internal checkpoint mechanism. The retransmit message is a hint that the order stage may have made accepted values that an execution replica has not yet learned. An execution replica acts on a retransmit message once it has received a small quorum of r + 1 retransmission messages: enough to ensure that at least one correct order replica believes some action by the execution replica is necessary. An execution replica sends last-exec messages to every order replica. The last-exec message explicitly states the last thing the sending replica learned and induces the order stage to resend the appropriate next-batch and load-cp messages to the execution replica. 6.2.4 Order stage properties We identified a set of properties to be maintained by the order stage in Chapter 5. Before discussing how the replicated order stage fulfills those properties, we must first adjust the properties to account for replication. The replicated order stage is 123 correct if it is safe despite up to r commission failures and live despite up to u total failures. This results in a pair of simple modifications to the safety and liveness properties: the prefix “if there are at most r commission failures, then” is added to the safety properties and the prefix “if there are at most u total failures and” is added to the liveness properties. We additionally further qualify OL1 to include the qualification “sufficiently often during a sufficiently long synchronous interval.” This additional qualification is made necessary by two properties of the Zyzzyvark protocol. First, for a batch to be ordered, Zyzzyvark requires coordination between multiple order replicas. The requisite communication is only guaranteed to happen during sufficiently long synchronous intervals. Second, Zyzzyvark relies on a primary to place requests in batches and propose an order for the batches. A faulty primary can fail to place specific requests into batches or fail to order requests entirely. The view-change protocol ensures that every primary is eventually replaced, guaranteeing that every request received infinitely often by the order stage during a synchronous interval is eventually received by a non-faulty primary and processed appropriately. The augmented safety and liveness properties are presented below. Note that the augmentations are distinguished through italics. OS1 If there are at most r commission failures, then only fetchable client requests authenticated by the authentication stage are placed into batches, and request nc issued by client c is placed in at most one batch. OS2 If there are at most r commission failures, then batches contain one or more requests and are assigned monotonically increasing batch identifiers no starting with 1 and increasing by 1 with each subsequent batch. For batches no and n′o with associated times t and t′ , no > n′o → t > t′ . OS3 If there are at most r commission failures and request nc > 1 issued by client c is in batch no , then request nc − 1 issued by client c is in batch n′o < no . OS4 If there are at most r commission failures, then the stage always has stable checkpoint at no , where no %CP interval = 0, and CP interval ≤ i ≤ 2 × CP interval subsequent ordered batches. OL1 If there are at most u total failures and the order stage receives, sufficiently often during a sufficiently long synchronous interval, unordered authenticated 124 request nc issued by correct client c, then the order stage places the request in batch no and eventually sends a next-batch message containing no to the execution stage. OL2 If there are at most u total failures and the order stage receives an authenticated request nc from client c that is already in batch no , then it instructs the execution stage to retransmit a response to request n′c from client c in batch n′o where n′c ≥ nc and n′o ≥ no . OL3 If there are at most u total failures and (i) the execution stage requests all batches after ne , (ii) the order stage has ordered batches through no > ne , and (iii) ne + 1 ≥ nCP , then the order stage resends all ordered batches from ne through no . OL4 If there are at most u total failures and the execution stage requests all batches after ne and the order stage has ordered batches through no > ne and ne + 1 < nCP , then the order stage instructs the execution stage to load execution checkpoint nCP . The final point for consideration is how the replication strategy discussed in this section fulfills these properties. The safety properties OS1-4 describe the inter- nal invariants maintained by Zyzzyvark and previous protocols such as PBFT [18], HQ [26], Zyzzyva [49], Aardvark [24], and others. Liveness property OL1 the basic liveness property of all asynchronous consensus protocols, and describes OL2-4 de- scribe internal messages used as part of ensuring that every value proposed by a correct proposer is eventually learned by all correct learners. Note that the “sufficiently often” condition of OL1 is satisfied through an interaction between correct clients and the authentication stage. Correct clients retransmit requests according to a regular schedule (at most four seconds between retransmissions) until a response to that request is received, and the authentication stage ensures that a request is retransmitted to the order stage at most once per four seconds. During synchronous intervals, the order stage receives an authenticated client request nc issued by correct client c every four seconds until it is ordered. 125 6.3 Replicated execution stage The primary responsibilities of the execution stage are delivering ordered batches to the application in the specified order and relaying the results of the executed requests to the clients. As part of processing each ordered batch, the execution stage notifies the authentication stage of the requests contained in that batch. Additionally, the execution stage sends an execution-stage checkpoint to the order stage every CP interval batches. We view all of these activities as part of a single consensus protocol. In this consensus protocol, the order stage acts as the single always correct proposer by proposing the sequence of ordered batches. The execution replicas accept the sequence of ordered batches and process batches in order. The clients, authentication stage, and order stage subsequently learn something—clients learn the results of executing batches in the specified ordered, the authentication stage learns which batch contains individual requests, and the order stage learns the executionstage checkpoint. Note that each class of learners explicitly learns a subset of the information accepted by the acceptors; the portions of accepted state not learned explicitly are learned implicitly. The handling of execution-stage checkpoints—specifically how the order stage learns the checkpoints—is the primary design decision that must be addressed when replicating the execution stage. We rely on indirect learning of the checkpoints, but note that other designs are possible. The rest of this section details the design and replication requirements of the replicated execution stage. Section 6.3.1 describes the consensus protocol implemented by the execution stage in more detail. Section 6.3.2 describes alternate design options for handling execution-stage checkpoints. Section 6.3.3 describes the impact that relying on a replicated execution stage has on the authentication stage, the order stage, and clients. Section 6.3.4 describes how the replicated execution stage fulfills the properties of a correct execution stage described in Chapter 5. 6.3.1 Execution consensus The consensus protocol implemented by the execution replicas is very simple and does not require any intra-stage communication because the order stage acts as a single always-correct proposer. Consensus with a single always-correct proposer 126 Ordered Batch Execution Checkpoint Batch Completed Reply Replica Execution Stage Replica Replica Order Stage max(u,r)+1 Authentication Stage r+1 Clients r+1 Figure 6.5: Execution consensus. follows the communication pattern shown in Figure 6.5: the order stage proposes a batch of requests, the execution stage accepts the batch, and the clients learn the results of executing the batch/the authentication stage learns which batch each request is placed in/the order stage learns an execution checkpoint. As mentioned in Section 6.1, asynchronous consensus with a single correct proposer requires at least u + r + 1 replicas [56]. In this environment, learners can learn when they receive a quorum of r + 1 matching messages from the acceptors unless indirect learning requiring a quorum of max{u, r} + 1 matching messages is necessary. The consensus protocol implemented by the execution replicas provides both regular learning (to the authentication stage and clients) and indirect learning (to the order stage). Note that unlike the replicated order stage, the replicated execution stage does not require any coordination between internal replicas to implement consensus. The basic consensus protocol shown in Figure 6.5 can consequently be implemented by a set of execution replicas running the execution stage pseudo-code described in Section 5.7.4 without modification. 127 6.3.2 Execution-stage checkpoints Recall that execution replicas send a hash of the execution-stage checkpoint to the order stage and not the checkpoint itself. When an execution replica falls behind or suffers from a transient crash, it learns the hash of the appropriate checkpoint to load and must subsequently fetch the checkpoint from another execution replica. Consequently, the order stage learns checkpoint hashes that are both correct and fetchable. This corresponds to indirect learning as discussed in Section 6.1 and requires at least max{u, r} + u + 1 execution replicas. While the consensus protocol itself does not require any coordination between execution replicas, allowing individual execution replicas to fetch execution-stage checkpoints from another replica does require additional coordination. Replica coordination The only interaction required between execution replicas occurs when one replica falls far enough behind the other replicas that it must load a checkpoint that is not present locally. Figure 6.6 contains execution replica pseudo-code that handles the exchange of state between replicas. The additional messages introduced are shown in Table 6.3; full byte specifications of these messages can be found in Appendix A.3. Recall from Section 5.3.2 that the execution stage loads a checkpoint upon receipt of a load-cp message from the order stage. For replicas that have the specified checkpoint in their local storage (i.e. because they are recovering from a transient crash), loading the checkpoint is simple. However, it is also possible for a replica to receive the load-cp checkpoint message and not have the execution stage checkpoint in local storage (e.g., because the replica became disconnected or suffered a transient crash and the other replicas made progress in its absence). When this occurs, the replica must first fetch the execution checkpoint described by the token Tcp contained in the load checkpoint message by sending an fetch-execcp message to other execution replicas. Another execution replica responds with an exec-cp-state message containing the checkpoint state; the fetching replica compares the state to the checkpoint token compared in the load-cp message and loads the state only if it is valid. As part of loading the execution checkpoint, the replica instructs the local copy of the application to load the application checkpoint contained in the execution checkpoint. If the application has the requisite state, 128 1 2 3 4 5 6 7 8 10 11 12 14 15 16 17 18 19 21 22 23 25 26 on rcv hload-cp, Tcp , no , oiµo,e : i f ∃ exec CP . ne and hash(exec CP . ne ) = Tcp then CPexec := exec CP . ne state := app.loadCP (CPexec .getAppCP ()) i f loadCP f a i l s b e c a u s e a p p l i c a t i o n i s m i s s i n g to E send hfetch-state, Tstate , eiµ ~ s t a t e Tstate then e,E else send hfetch-exec-cp, n, eiµ ~ on rcv hfetch-exec-cp, n, eiµ ~ if e,E e,E to E : c h e c k p o i n t Tcp i s l o c a l l y a v a i l a b l e then send hexec-cp-state, n, S, this.eiµthis.e,e on rcv hexec-cp-state, n, S, eiµe,this.e : if r e q u e s t e d c h e c k p o i n t n and S matches Tcp then CPexec := exec CP . ne state := app.loadCP (CPexec .getAppCP ()) i f loadCP f a i l s b e c a u s e a p p l i c a t i o n i s m i s s i n g send hfetch-state, Tstate , eiµ ~ e,E to E s t a t e Tstate then m = hfetch-state, Tstate , eiµ ~ e,E : S := app.getState(m.Tstate ) send hstate, m.Tstate , S, this.eiµthis.e,m.e to m.e on rcv on rcv m = hstate, Tstate , S, eiµe,this.e : state := app.loadState(m.S, m.Tstate ) Figure 6.6: Execution replica pseudo-code related to intra-stage checkpoint and state transfer. Message fetch-exec-cp exec-cp-state fetch-state state Semantic meaning Fetch execution checkpoint ne Contains the state S of execution checkpoint ne Fetch application state described by Tstate Contains application state S described by Tstate Table 6.3: State management messages exchanged between execution replicas. then the checkpoint is loaded and operation can continue. It is likely, however, that the application may not have all of the requisite state available locally. If this is the case, the replica fetches the missing state by exchanging fetch-state and state messages with other execution replicas. The application can provide the full checkpoint to the execution stage or a token that describes the checkpoint concisely. If the former option is chosen then the execution replicas never fetch application state using the latter two messages in Table 6.3. If the latter option is chosen, then those two messages may be used to retrieve relevant state from other execution replicas. 129 Checkpoint alternatives Note that we require u + max{u, r} + 1 execution replicas rather than u + r + 1 because execution checkpoints are learned indirectly. Specifically, we require the execution stage to send a token, or cryptographic hash, describing the checkpoint to the order stage. Because we store the hash of the checkpoint at the order stage and the checkpoint at individual execution replicas, the order stage must be sure both that the checkpoint hash is correct and that the checkpoint is stored by at least one correct execution replica so that it can be fetched by another replica as needed. There are two natural questions to ask. First, can we simplify the max{u, r} part of that expression? Second, can we avoid sending the execution-stage checkpoint (or its hash) to the order stage? At a high level, the answer to both questions is “no.” We can, in theory, simplify the max{u, r} portion of the expression to r by storing the entire execution-stage checkpoint (and not a hash) at the order-stage. We reject this approach because it can dramatically increase the network requirements of the system. Similarly, we can remove the execution-stage checkpoint from the order stage entirely by increasing the number of execution replicas or relying on digital signatures to authenticate checkpoints. We reject these approaches for two reasons. First, to ensure that the order stage does not outrun the execution stage (i.e. orders several checkpoint intervals worth of batches that are not delivered to the execution stage because a lossy network), the execution stage must notify the order stage when it has completed a checkpoint. Second, augmenting that checkpoint notification to include a hash of the execution-stage checkpoint is less expensive than (a) authenticating execution-stage checkpoints with digital signatures or (b) increasing the number of execution replicas. Can we simplify max{u, r}? It is straightforward to simplify the required num- ber of execution replicas to r + u + 1 by storing execution checkpoints at the order stage rather than tokens that describe the checkpoint. If this approach is taken, the order stage need only affirm that the checkpoint was correctly generated (i.e. receive at least r + 1 matching checkpoint messages) and does not need to ascertain that the checkpoint will be fetchable by another execution replica. The repercus- 130 sions of storing the full execution checkpoint at the order stage are very application and deployment dependent. In deployments where there are few clients and the application checkpoints are very small, then the execution checkpoints will be small and inexpensive to transfer to and store at the order stage. On the other hand, if there are large numbers of clients or the application checkpoints are large (gigabyte or even terabytes), then the costs of transferring the checkpoint from the execution replicas to the order stage and maintaining that checkpoint within the order stage may become prohibitive. We choose to err on the side of conserving network bandwidth and simplifying the order stage and consequently store checkpoint hashes rather than the full checkpoints at the order stage. Can we avoid sending the execution-stage checkpoint (or its hash) to the order stage? Previous work by on separating order from execution by Yin et al. [107] is based on a protocol where the order stage is oblivious to checkpoints generated by the execution stage and requires u + r + 1 execution replicas5 . Lamport [60] presents a similar architecture that requires u + 1 execution replicas for a CFT system. We could adopt a similar approach and not store any reference to execution stage checkpoints at the order stage. Doing so would, however, require us to either use digital signatures6 to authenticate execution stage checkpoints or rely on u + 2r + 1 execution replicas to provide one-step transferability within the execution stage. We believe it is better to store the execution-stage checkpoint at the order stage than to introduce digital signatures or increase the number of execution replicas. In order to understand why digital signatures are necessary if there are only u + max{u, r} + 1 execution replicas, let us consider a deployment where u = r = 1 and there are 3 execution replicas. Suppose replica a is correct, but does not receive any messages because of a lossy network. Meanwhile, the other two replicas, b and c, process ordered batches from the execution stage. Replica b is in fact Byzantine, but follows the protocol faithfully and generates correct client responses. After several checkpoint intervals, the network failure is repaired and replica a begins receiving 5 Note that the work was presented as requiring 2f + 1 execution replicas where f = u = r. Note that non-repudiation provided by digital signatures is equivalent to ∞-step transferability. Any authentication scheme that provides non-repudiation suffices. 6 131 messages again. At this point, a discovers that it is very far behind its peers and requests the most recent checkpoint from both b and c. Replica c responds with the correct checkpoint while replica b responds with a different checkpoint. Replica a is potentially in the unfortunate position of not being able to differentiate the correct checkpoint from a faulty checkpoint. We could avoid this problem by having the replicas agree on the checkpoint and a proof that the checkpoint is correct. While this hypothetical proof would certainly ensure that only correct checkpoints are loaded, implementing the proof requires digital signatures (or another authentication scheme that provides nonrepudiation) to ensure that the checkpoint will be loaded. Using digital signatures, b could gather a proof by waiting for digital signatures that match its checkpoint from r other replicas; this would require a total of at least u + r + 1 replicas. Note that even with digital signatures, u + max{u, r} + 1 execution replicas are required; u + r + 1 replicas do not suffice. Consider a setting where u > r = 0 and there are u + 1 ≥ 2 total execution replicas. Assume, for the moment, that u replicas are caught behind a network partition resulting in only one execution replica processing batches from the order stage. The execution stage is guaranteed to be live despite up to u failures, so the system is able to continue processing requests as long as the clients continue to provide them. Note that the disconnected replicas are not actually faulty, but are prevented from receiving messages by an asynchronous and lossy network. Now suppose that the single active replica suffers a permanent crash and that the network failure is simultaneously repaired, but only after the system has processed several checkpoint intervals worth of requests. When the formerly disconnected replicas begin receiving messages again, they are unable to process the batches because their local state is not current, they do not possess a recent checkpoint because of garbage collection, and they are unable to fetch a recent checkpoint because the only replica that had the checkpoint is now failed. The net result is that the system cannot make safe progress despite the fact that no replica is guilty of a commission failure and only 1 ≤ u replicas have failed. We could replace digital signatures in the previous discussion with matrix signatures [3]. Doing so would trade the expense of digital signatures for additional execution replicas. Matrix signatures can be implemented using MACs, but require 3f = 2u+r +1 replicas [3]; replacing digital signatures with matrix signatures would require u + max{u, r} + r + 1 execution replicas. 132 Regardless of which approach we use to remove execution-stage checkpoints from the order stage, the execution stage must notify the order stage when it generates a checkpoint to prevent the order stage from outrunning the execution stage. Given this constraint, and the three options of (a) storing an execution-stage checkpoint token at the order stage, (b) using digital signatures to authenticate executionstage checkpoints, and (c) increasing the number of execution replicas, we believe that storing an execution-stage checkpoint token at the order stage is the most reasonable decision. Summary. Table 6.4 shows the tradeoffs for various checkpointing strategies: (1) the required execution replicas, (2) the network costs, and (3) the computation costs. We consider schemes that rely on digital signatures to have high computation costs and schemes that rely exclusively on MACs to have low computation costs. Schemes that push one or more copies of the execution checkpoint across the network have high network costs, while schemes that exclusively push hashes of the checkpoint have low network costs. We compare four schemes for handling execution checkpoints. In the first scheme, we store a full checkpoint at the order stage. In the second scheme, we store the hash of the checkpoint at the order stage. In the third scheme, we do not store anything related to the checkpoint at the order stage and rely on digital signatures to generate a transferable proof for execution replicas to exchange with a valid checkpoint. In the final scheme, we replace digital signatures with matrix signatures. Storing the hash of the checkpoint at the order stage provides the right practical tradeoff between the required number of execution replicas and total network and computational costs. 6.3.3 Interactions with other stages Replicating the execution stage impacts how the authentication stage, order stage, and clients process messages received from the execution stage and how they send messages to the execution stage. To understand these changes, we divide the interstage messages into two categories: consensus messages and state management messages. The consensus messages are the proposal and learning messages from the consensus protocol as well as the utility messages that alert the execution replicas 133 Checkpoint strategy Full CP at order stage Hash of CP at order stage Full CP at execution stage with digital signatures Full CP at execution stage with matrix signatures Required execution replicas u+r+1 u + max{u, r} + 1 Network costs high low Computation costs low low u + max{u, r} + 1 low high u + max{u, r} + r + 1 low low Table 6.4: Summary of replication requirements for different checkpoint storage strategies. Message next-batch batch-complete reply cp-token retransmit last-exec Consensus Semantics proposal learn learn learn utility — learning failed utility — missed proposal Table 6.5: Inter stage messages and their role in the execution consensus protocol. that something should have happened. The state management messages are used to transfer request bodies from the authentication stage to the execution replicas. Table 6.5 shows the six inter-stage messages that are part of the execution stage consensus protocol. The next-batch message is the proposal and is sent by the order stage to all execution replicas. The batch-complete, reply, and cptoken messages are sent by execution replicas to the authentication stage, client that issued the request, and order stage respectively. Upon receipt of a quorum of n − u matching messages, the recipient can safely learn the contents of the message. The retransmit message is sent by the order stage to every execution replica as a notification that either an accepted value was not learned or a proposed value was not accepted. An individual execution replica sends the last-exec message to the order stage to indicate that the replica did not receive a proposal—the order stage processes last-exec messages on a replica-by-replica basis and does not gather a quorum of matching messages. Table 6.6 shows the two inter-stage state management messages. These messages are used to transfer request bodies from the authentication stage to individual 134 Message fetch command Consensus Semantics none – state management none – state management Table 6.6: Inter stage messages related to stage management. execution replicas. After receiving an ordered batch, an execution replica sends the fetch message to the authentication stage indicating that the replica needs the specified request body. The authentication stage responds by sending a command message containing the request body to the execution replica that issued the fetch message. 6.3.4 Execution stage properties We identified the properties maintained by a correct execution stage in Chapter 5. A replicated execution stage is correct if it maintains the safety properties despite up to r commission failures and the liveness properties despite up to u total failures. Additionally, indirect learning requires ES4 to hold despite up to u total failures. The requisite modifications to the safety and liveness properties are italicized below. ES1 If there are at most r commission failures, then batch no is only delivered to the application if the last batch delivered to the application is no − 1. ES2 If there are at most r commission failures, then only ordered batches are delivered to the application ES3 If there are at most r commission failures, then only responses generated by the application are cached or sent to clients. ES4 If there are at most r commission failures and at most u total failures, then execution stage maintains the execution checkpoint referenced by order base checkpoint in persistent memory. ES5 If there are at most r commission failures, then then the execution stage provides deterministic and replayable execution of ordered batches. EL1A If there are at most u total failures and the execution stage receives ordered batch no and the last batch it has delivered to the application is n′o < no , then 135 it fetches the request bodies for requests in batch no from the authentication stage and notifies the authentication stage that the contained requests have been ordered. EL1B If there are at most u total failures and the execution stage has all of the request bodies for batch no and the last batch it delivered to the application is no − 1, then the execution stage delivers batch no to the application. EL2 If there are at most u total failures and the execution stage receives a response from the application, then it stores the response for retransmission and sends the response to the responsible client. EL3 If there are at most u total failures and the execution stage receives a retransmission instruction for request nc from c in batch no and the last batch executed by the execution stage is ne > no , then the execution stage resends the response to the most recent request n′c ≥ nc executed for client c and notifies the authentication stage that n′c has been ordered no later than batch ne . EL4 If there are at most u total failures and the execution stage receives a retransmission instruction for request nc from client c in batch no and the last batch executed by the execution stage is ne < no , then the execution stage informs the order stage that it has missed the batches since ne . EL5 If there are at most u total failures and the execution stage receives an instruction to load checkpoint ne from the order stage, then it loads execution checkpoint ne . Each correct execution replica independently implements the safety and live- ness properties above. Coordinating the replicas through the consensus protocol as discussed in Section 6.3.1 ensures that a collection of at least max{u, r} + u + 1 execution replicas is sufficient to implement a correct execution stage. 6.4 Replicating authentication stage The authentication stage is responsible for authenticating requests issued by authorized clients, caching the body of those requests, and delivering a hash of authen136 ticated requests to the order stage. This process maps to a collection of consensus protocols, one per client per request. The consensus protocols share the same set of acceptors (the authentication replicas) and learners (the order stage) and are differentiated by the proposer (each client is a proposer in a distinct instance of consensus each time it issues a distinct request). The rest of this section details the design of the replicated authentication stage. Section 6.4.1 describes the implementation of each authentication replica and intra-stage coordination. Section 6.4.2 describes the impact of replicating the authentication stage on clients, the order stage, and the authentication stage. Section 6.4.3 describes how the replicated authentication stage fulfills the properties of a correct authentication stage described in Chapter 5. 6.4.1 Authentication consensus We map the authentication stage to the acceptors in a collection of consensus protocols. Each distinct request issued by a client c is the proposal for a distinct instance of consensus. The authentication replicas accept the request. The order stage learns request hashes that (a) correspond to requests issued by authorized clients, (b) correspond to request bodies that are cached by the authentication stage, and (c) are one-step transferable. As discussed in Section 6.1, a total of u+r+1 replicas are sufficient to provide basic consensus with a single proposer and satisfy requirement (a). Requirement (b) lays out the need for indirect learning and a baseline of u + max{u, r} + 1 replicas. Requirement (c) requires one-step transferability and increases the requisite number of authentication replicas to the final total of u + max{u, r} + r + 1. Authentication replicas implement the authentication stage pseudo-code presented in Chapter ?? and do not communicate with each other when processing client requests. The communication induced by this (lack of) coordination is similar to the consensus protocol employed by the execution stage and can be found in Figure 6.7. The authentication stage differs from the execution stage in two important ways. First, the authentication stage does not require any checkpoints to be coordinated between the authentication replicas because it is okay for replica state to diverge. Values learned from the execution stage depend on each other—it is impossible for the execution stage to process batch no without first processing batch 137 Client Request Authenticated Request Replica Authentication Stage Replica Replica Clients Order Stage max{u,r}+r+1 Figure 6.7: Authentication consensus. 138 Message hclient-req, hreq-core, c, nc , opi, ciµ~ c,F hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O hbatch-complete, v, no , C, eiµ~ e,F hfetch, no , c, nc , hash(op), eiµ~ e,F hcommand, no , c, nc , op, f iµf,e Sent by client authentication stage execution stage execution stage authentication stage Table 6.7: Messages sent to and from the authentication stage. n′o < no . Values learned from the authentication stage, on the other hand, are independent of each other—learning that “client c said X” does not require any knowledge that “client c′ said Y.” Second, the authentication stage is required to provide one-step transferability of authenticated requests. This requirement is important because the order stage is based on a primary-led consensus protocol, and we rely on MACs for message authentication. 6.4.2 Interactions with other stages The authentication stage receives three messages from other stages and sends two messages to other stages. The complete set of messages sent to and processed by the authentication stage is shown in Table 6.7. The first time a client c issues request nc , the client optimistically assumes that the network is well-behaved and there are no failed authentication replicas and sends a client-req message to a preferred medium quorum of n − u authentication replicas. The preferred quorum used by client c consists of the n − u authentication replicas starting with replica c mod n. If c retransmits the client-req message containing request nc then it sends the request to all authentication replicas on the assumption that either the network is ill-behaved or one or more replicas in its preferred quorum are in fact faulty. The order stage primary gathers a medium quorum of max{u, r} + r + 1 auth-req messages before placing a request in a batch. Order replicas, including the primary, gather a small quorum of r + 1 auth-req messages before sending a retransmit message for a previously ordered request. The execution stage sends batch-complete messages to all authentication replicas. Execution replicas initially send fetch messages to a specific member 139 of each client’s designated preferred quorum—authentication replica c mod n. If the execution replica does not receive the request body, then it resends the fetch message to all authentication replicas. An execution replica may act on the first fetch message that it receives from an authentication replica, though it checks the body against the request hash contained in the ordered batch before acting on the body. 6.4.3 Authentication stage properties We modify the authentication stage properties identified Chapter 5 to accommodate the UpRight design goals. A replicated execution stage is correct if it maintains the safety properties despite up to r commission failures and the liveness properties despite up to u total failures. The requisite modifications to the safety and liveness properties are italicized below. AS1 If there are at most r commission failures, then only requests issued by authorized clients are authenticated and every authenticated request is one-step transferable. AS2 If there are at most r commission failures and at most u total failures, then every authenticated request referenced by a batch ordered since the base checkpoint at the order stage or not yet ordered is fetchable. AL1a If there are at most u total failures and the authentication stage receives a request nc issued by correct client c and there is no pending request n′′c < nc , then request nc is authenticated and sent to the order stage. AL1b If there are at most u total failures and the authentication stage receives a request nc issued by correct client c and there is a pending request n′c , then request n′c is authenticated and sent to the order stage. AL2 If there are at most u total failures and the authentication stage receives a fetch body message from the execution stage for a authenticated request nc issued by client c, then the authentication stage responds with the request body. The safety properties AS3 and AS4 are maintained by individual authentication replicas and are not properties maintained by the authentication stage as a whole. 140 Note that in the context of the end-to-end system AS3 and AS4 are not strictly necessary. These two properties are used to limit the rate at which faulty clients can force the system to consume storage and bandwidth. AS3 At most one request per identifier nc per authorized client c is authenticated. AS4 When request nc from client c is authenticated, no request n′c > nc has been authenticated and there is no pending request n′c < nc . Each authentication replica implements the protocol described in Figure 5.7 and correct replicas maintain local versions of the authentication stage safety and liveness properties (intuitively, replace “authentication stage” with “authentication replica” and ignore the failure count qualifier). The union of replicas that individually provide the specified properties ensures that the stage as a whole provides the properties. 6.5 Implementation and performance We implement the UpRight library based on the inter-stage protocol described in Chapter 5 and the replicated stages previously discussed in this chapter in Java and regrettably must name the prototype JSZyzzyvark7 ; J-Zyzzyvark refers to a configuration where we omit writing to disk to compare more meaningfully our prototype with prior Byzantine agreement protocols and to expose bottlenecks in our protocol. We believe that a Java-based solution is more suitable for widespread deployment with the Java-based Zookeeper and HDFS systems than a C implementation despite the difference in performance between C and Java implementations. We also note that logging actions to disk places a ceiling on throughput so the benefits of further optimization may be limited. We run our servers on 3GHz dual-core Pentium-IV machines, each running Linux 2.6 and Sun’s Java 1.6 JVM. We use the FlexiProvider [36] cryptographic libraries for MACs and digital signatures and the Netty [74] networking library for asynchronous Java I/O. Nodes have 2GB of memory and are connected via a 100Mbit/s Ethernet. Except where noted, we use separate machines for authentication, order, and execution replicas. 7 “J” because the prototype is implemented in Java, “S” because the prototype stores state to stable storage, and “Zyzzyvark” because the prototype is based on the Zyzzyvark protocol. 141 70 J-Zyzzyvark 1KB 60 Latency (ms) 50 JS-Zyzzyvark 1KB 40 30 JS-Zyzzyvark 1B 20 J-Zyzzyvark 1B 10 0 0 1 2 3 4 Throughput (Kops/s) 5 6 Figure 6.8: Latency v. throughput for J-Zyzzyvark and JSZyzzyvark. The UpRight library (client, authentication, order, and execution stages) comprise 20,403 lines of code (LOC). Method. Our basic experimental setup involves correct clients that operate in a closed loop—that is they issue requests one at a time and do not issue request i until they receive a response to request i − 1. Unless otherwise noted, correct clients issue 100k requests. We increase system load by increasing the number of clients. Clients record the time at which each request is issued and the response received. We calculate the average latency of all requests issued by all clients. We calculate per second throughput by dividing the total duration of the experiment, in seconds, by the total number of requests issued by all clients. Each data point corresponds to a single experimental run. Response time and throughput. Figure 6.8 shows the throughput and response time of J-Zyzzyvark and JSZyzzyvark. We vary the number of clients issuing 1 byte or 1 KB null requests that produce 1 byte or 1 KB responses and drive the system to saturation. We configure the system to tolerate 1 fault (u = r = 1). 142 100 u=2, r=2 u=3, r=0 80 Latency (ms) u=2, r=1 60 u=2, r=0 40 u=1, r=1 20 u=1, r=0 u=0, r=0 0 0 2 4 6 8 10 Throughput (Kops/s) 12 14 Figure 6.9: Latency v. throughput for JSZyzzyvark configured for various values of r and u. For small requests J-Zyzzyvark’s and JSZyzzyvark’s peak throughputs are a respectable 5.5 and 5.1 Kops/second, which suffices for our applications. They are comparable to unmodified Zookeeper’s peak throughput for small read/write requests, and they appear sufficient to support an HDFS installation with a few thousand active clients. Peak throughputs fall to 4.5 and 4.2 Kops/second for a workload with larger 1KB requests and 1KB replies. For comparison, in Chapter 3 we reported small request throughputs of 7.6, 23.8, 38.6, 61.7, and 66.0 Kops/s for the C/C++-based HQ [26], Q/U [1], Aardvark [24], PBFT [18], and Zyzzyva [49] on the same hardware. For environments where performance is more important than portability or easy packaging with existing Java code bases, we believe a well-tuned C implementation of Zyzzyvark with writes to stable storage omitted would have throughput between that of Aardvark and Zyzzyva—our request validation and agreement protocols are cheaper than Aardvark’s, but our request validation is more expensive than Zyzzyva’s. 143 100 u=2, r=1 u=2, r=2 u=1, r=1 Latency (ms) 80 60 u=3, r=0 u=2, r=0 40 u=1, r=0 20 u=0, r=0 0 0 2 4 6 Throughput (Kops/s) 8 10 Figure 6.10: Latency v. throughput for JSZyzzyvark configured for various values of r and u with authentication, order, and execution replicas colocated. 0.3 Jiffies/request 0.25 RQ Order Execution 0.2 0.15 0.1 16 64 Clients 128 Separate Colocated Separate Colocated Separate Colocated Separate Colocated 1 Separate 0 Colocated 0.05 256 Figure 6.11: Jiffies per request. RQ indicates the jiffies at the authentication stage; Order indicates the jiffies at the order stage; Execution indicates the jiffies at the execution stage. 144 Other configurations. Figure 6.9 shows small-request performance as we vary u and r. Recall that Zyzzyvark requires 2u + r + 1 authentication and order replicas and u + r + 1 execution replicas to ensure that it can tolerate u failures and remain up and r failures and remain right. Peak throughput is 11.1 Kops/second when JSZyzzyvark is configured with u = 1 and r = 0 to tolerate a single omission failure (e.g., one crashed replica), and throughput falls as the number of faults tolerated increases. For reference, we include the u = 0 r = 0 line for which the system has just one authentication, order, and execution replica and cannot tolerate any faults; peak throughput exceeds 22 Kops/s, at which point we are limited by the load that our clients can generate. Figure 6.10 shows small-request performance when the authentication, order, and execution replicas are co-located on 2u + r + 1 total machines. Splitting phases across machines improves peak throughput by factors from 1.67 to 1.04 over such co-location when any fault tolerance is enabled, with the difference falling as the degree of fault tolerance increases. Figure 6.11 shows the the number of CPU jiffies (4ms of CPU time) per request summed across authentication, order, and execution processes on all replicas for two configurations: (1) when all stages share a common set of machines and (2) when each stage runs on its own separate set of machines. As load increases, larger batch sizes amortize some costs, reducing processing per request. In the second configuration, the bottleneck is the order stage, and the execution replicas are lightly utilized. The higher per-request processing cost that we observe in the first configuration is unexpected and we have not to date identified a convincing explanation for it. Request authentication. In Figure 6.12 we examine the throughput of the JS- Zyzzyvark prototype configured for u = 1 and r = 1 and using different strategies for client request authentication. The MAC RQ line shows performance of the default JSZyzzyvark configuration that relies on MAC-based matrix signatures formed at the authentication stage. In contrast, the SIG no RQ line omits the authentication stage entirely and shows the significant performance penalty imposed by relying on traditional digital signatures for request authentication, as in Aardvark. The MAC no RQ line shows the performance that is possible in a system that relies, like PBFT, on MAC authenticators and uses no authentication stage for client authentication. In a system where the robustness risk and corner-case complexity of 145 100 Latency (ms) 80 60 Sig no RQ Mac RQ 40 Mac no RQ No auth RQ 20 No auth no RQ 0 0 2 4 6 8 Throughput (Kops/s) 10 12 Figure 6.12: JSZyzzyvark performance when using the authentication replica and matrix signatures, standard signatures, and MAC authenticators. (1B requests) relying on MAC authenticators as opposed to matrix signatures are viewed as acceptable, this configuration may be attractive. For comparison, the no auth RQ line shows performance when we use the authentication stage but turn off calculation and verification of MACs, and the no auth no RQ line shows performance when we eliminate the authentication stage and also turn off calculation and verification of MACs. Request digests. Figure 6.13 demonstrates the value of storing requests at the authentication stage so that the order stage can operate on digests rather than full requests. We configure the system for u = 1 and r = 1. For small requests (under 64 bytes in our prototype), the authentication stage sends full requests and the order replicas operate on full requests; the figure’s 1B Request line shows performance for 1 byte requests. The 1KB Digest and 10KB Digest lines show performance for 1KB and 10KB requests when authentication replicas store requests and send request digests for ordering, and the 1KB Request and 10KB Request lines show performance with the request storage and digests turned off so that order replicas 146 100 10KB Request 80 Latency (ms) 10KB Digest 60 1KB Request 40 1KB Digest 20 1B Request 0 0 1 2 3 4 Throughput (Kops/s) 5 6 Figure 6.13: JSZyzzyvark performance for 1B, 1KB, and 10KB requests, and for 1KB and 10KB requests where full requests, rather than digests, are routed through order replicas. 147 operate on full requests. Storing requests at the authentication more than doubles peak throughput for 1KB and 10KB requests. 6.6 Discussion Separating the stages in the UpRight architecture facilitates a clean and modular design and implementation. The separation is a logical separation only and there is no fundamental reason to not colocate an authentication, order, and execution replica on the same machine. When this is done, the replicas communicate as if they are all on distinct machines8 . It is tempting to bind colocated replicas to each other more tightly in order to eliminate the intra-stage communication steps, especially the all-to-one authentication-to-order and all-to-all order-to-execution steps. This temptation is misguided, however, as these communication steps are intrinsic to important properties of our design. The authentication-to-order step allows us to order request hashes rather than full request bodies and avoid the dangers of inconsistently authenticated client requests without relying on digital signatures while the order-to-execution step allows us to ensure that no replica is ever required to roll back application state. The authentication stage is responsible for authenticating requests issued by clients and caching the bodies of those requests until they are ordered. The former responsibility simplifies the protocol for agreeing on the order of batches by ensuring that any request deemed valid by a correct primary will also be deemed valid by a correct replica (discussed in Section 3.3 and Section 6.2) while the latter responsibility reduces the network bandwidth required when agreeing on the batch execution order. Removing the all-to-one communication between the authentication stage and the order stage primary effectively eliminates the authentication stage from the architecture and would require the agreement protocol to order client requests, increasing inter-order-stage bandwidth and storage costs, and either use digital signatures to authenticate client requests or implement the code to handle the corner case where a client request is inconsistently authenticated (or suffer the consequences observed in Section 3.6.2). The all-to-all communication between the order and execution stages ensures 8 An obvious optimization in that case is to send messages to colocated replicas using memory channels rather than relying on the network infrastructure. 148 that the execution replicas only execute a batch of requests when that batch is definitely the next batch in the sequence. One important side effect of this design is that the execution replicas are never required to roll application state back. While rolling application state back is feasible, the standard mechanism used to roll application state back is to load an old checkpoint and then replay the requests since the old checkpoint. This can be an expensive activity and is one that should be avoided. In order to maintain this property of no application roll back with the order and execution stages merged, the replicas could not execute batches until after the prepare phase completed—merging the order and execution stages would not reduce the required number of message delays and could actually increase the total network traffic since there are always at least as many order replicas as execution replicas. Of course, if an application can support fine grained checkpoints and roll back then it may be tenable to allow the execution replicas to execute batches speculatively, rather than rely on speculation simply to speed the learning process. 6.7 Conclusion This chapter describes the replicated implementation of the authentication, order, and execution stages specified in Chapter 5. The key to understanding the design of each stage is understanding the problem of consensus and the impact that the context that consensus is being solved in has on the number of required replicas. The authentication, order, and execution stages each implement the acceptors for one or more instances of consensus. Once those instances are identified, the design of each stage is straightforward. Understanding the mapping between each stage and consensus allows us to implement each stage with a minimal number of replicas and also to identify and fix problems with previous systems that attempt to separate the stages of state machine replication. In the context of this chapter, UpRight fault tolerance is a fact of the implementation and design. The next chapter describes our experience incorporating the UpRight library into HDFS and ZooKeeper. The value of the flexibility of UpRight fault tolerance will be discussed in that context. 149 Chapter 7 UpRight Applications The previous chapters have focused on the challenges of implementing the library specification provided in Chapter 4. This chapter, in contrast, focuses on our experiences adapting the ZooKeeper distributed coordination service [108] and Hadoop distributed file system (HDFS) [43] to be compatible with the UpRight library. Our goal in this chapter is to use these two systems as case studies to demonstrate three points: 1. The application changes required to make existing applications UpRight are small in scope and complexity. 2. UpRight applications provide flexible fault tolerance—a single code base can provide different levels of crash and Byzantine fault tolerance through simple modifications to a configuration file. 3. The performance of UpRight applications is competitive with the performance of the unmodified code bases. While the second point can be objectively demonstrated, the first and third are largely subjective and are demonstrated through observation and experience reports. While the experiences reported in this chapter are specific to our two case studies, we believe the lessons learned are applicable to other applications for which UpRight is likely to be of interest. Recall that we specified a set of properties to be maintained by the application in Chapter 4. Intuitive statements of these properties are shown in Table 7.1 for 150 APPS1 APPS2 APPS3 APPS4 APPL1 APPL2 APPL3 Only requests contained in batches received from the library are executed. Requests are executed deterministically Checkpoints are generated deterministically Loading a checkpoint puts the application into the state it was in when the checkpoint was generated. The function call execute(batch) returns. The function call takeCP () returns. The function call loadCP () returns. Table 7.1: Informal statement of application requirements. easy reference. The primary challenge we face in this chapter is adapting ZooKeeper and HDFS to meet these requirements, namely to provide deterministic execution (APPS2) and on-demand deterministic checkpoint generation (APPS3,4). The UpRight library, because it is implemented in Java, implicitly requires applications to be written in Java though there is no fundamental reason that the library cannot be ported to support applications written in other programming languages. Application requirement APPS1 requires the application to process only valid requests, that is requests delivered to the application by the by the replication library, and is trivial to maintain. Application requirement APPS2 requires the application to execute batches deterministically—given an application state and a batch of requests, the application should always produce the same set of responses and end in the same application state. We employ standard techniques for ensuring deterministic request execution [1, 18, 24, 26, 49, 50, 86, 92, 104, 107] when modifying ZooKeeper and HDFS to fulfill this requirement. Application requirements APPS3 and APPS4 require the application to produce determinist checkpoints on demand. As discussed at the end of Section 5.3.2, generating application checkpoints on demand plays an important role in bounding the state stored by the UpRight library and bringing “slow” replicas up to speed. We suspect that applications for which replication is appropriate already rely on some form of checkpointing to handle power outages and other transient failures; our experiences with ZooKeeper and HDFS reinforce this belief. However, our experiences indicate two challenges to providing deterministic on demand checkpoints as required by the UpRight library. First, the checkpoints are not always deterministic 151 or complete—HDFS stores some required information as soft state and relies on an asynchronous protocol to replenish that state following a transient crash while ZooKeeper creates fuzzy snapshots that are equivalent but not identical. Second, applications are generally tuned to generate checkpoints at a much slower rate (10,000s of requests) than the UpRight library (100s of requests). To address the first challenge, we modify the native HDFS and ZooKeeper to provide complete and deterministic checkpoints on demand. We address the second challenge by implementing a generic log-based checkpointing mechanism that combines large-granularity application checkpoints with a batch log to produce the fine-grained checkpoints required by the UpRight library. Note that the requirement that the application generate deterministic checkpoints on demand departs from the application requirements imposed by most previous libraries [1, 18, 24, 26, 49, 50, 86, 92, 104, 107]. These libraries do not expose checkpointing to the application and instead require the application developer to store relevant application state in a memory space managed by the replication library. The replication library, in turn, handles checkpoint generation and rollback without any application support. The primary drawback to this approach is that it can require significant parts of the application to be rewritten. In contrast, the requirement that the application implement its own checkpointing actually facilitates reuse of existing functionality. We approach the challenges of modifying an application using the framework shown in Figure 7.1. From the application developer’s perspective, the UpRight library is a black box. The application developer’s sole responsibility is attaching the application to the library—the application server must attach to the executionstage and each application client attaches to a distinct library client instance. We conceptually divide the execution stage and client into three distinct components to facilitate this process: 1. A generic shim: the shim moderates communication between the stages in the UpRight architecture. The shim implements the execution stage of the UpRight architecture and exports a simple API to the application. 2. Application-specific glue: The application-specific glue is the bridge between the UpRight library and the application. it is the one part of the system where knowledge of both how the application works and awareness of repli152 Shim Glue Shim Glue Shim Glue Glue Shim Application Client Application Server Application Server Client Application Server Execution Replicas UpRight Library Figure 7.1: UpRight application architecture from an application developer perspective. The UpRight library is a black box with a well defined interface. At both the client and the server, the developer implements application-specific glue that connects the library shim to the original application. cation are mixed. The glue contains the application-specific knowledge necessary for replication: demuxing request batches, maintaining and constructing application-specific instantaneous checkpoints, identifying which state must be transferred in order to load a checkpoint, etc. At the client stage the glue performs appropriate request pre-processing and response post-processing. 3. The application: the application is the (mostly) unmodified application. The application is responsible for providing deterministic execution and deterministic checkpoints. Our goal is to keep changes to the application to a minimum and isolate the application awareness of replication in the glue. The rest of this chapter contains six sections. Section 7.1 provides an overview of what is needed to provide deterministic request execution that satisfy APPS2 , and APPL1 , and APPL3 , . Section 7.2 describes a generic checkpoint management scheme that we adapt for use with both ZooKeeper and HDFS to provide APPL2 APPS1 APPS3 , APPS4 , . Section 7.3 describes specifics of our experience with HDFS and reports on observed performance. Section 7.4 describes specifics of our experience 153 with ZooKeeper and reports on observed performance. Section 7.5 summarizes our experiences with ZooKeeper and HDFS and highlights the key lessons learned. The Java APIs for the client and server shims and glues can be found in Appendix B. 7.1 Request Processing We identified three main challenges in ensuring that batch execution meet the requirements of ing APPS2 APPS1 and APPS2 . APPS1 , unsurprisingly, was straightforward. Enforc- required us to (a) demux batches of requests into individual requests, (b) handle sources of nondeterminism including PRNG seeds and system time, and (c) address challenges associated with multi-threading. We report on the techniques we found sufficient in our work with HDFS and ZooKeeper. Note that the experiences we report here and in the rest of this chapter are pragmatic responses to the challenges that we encountered and are not driven by first principles. Nonetheless, we believe that these challenges (and our solutions) are likely to be relevant to many replicated applications. Our core strategy to providing deterministic execution is ensuring that our applications execute requests deterministically and sequentially based on the order they appear in the batch. Ensuring deterministic execution is a well explored research area [48, 73]. Our approach to this problem is guided by the goal of making minimal changes to the application and the run-time, rather than identifying a principled and potentially invasive approach that can automatically be applied to an arbitrary application. Demuxing batches. The UpRight library provides the application with batches of requests. The glue/application must demux the batches into individual requests for execution. We take the simple approach of interpreting the batch sequentially— the first request is executed first, the second is executed second, and so forth. Nondeterminism. Many applications rely on real time or random numbers as part of normal operation. These factors can be used in many ways including garbage collecting soft state, naming new data structures, or declaring uncommunicative nodes dead. Each request issued by the UpRight shim to the application server glue 154 is accompanied by a time and random seed to be used in conjunction with executing the request [18]. UpRight applications must be modified to rely on these specified times rather than the local machine time and to use the random seed as appropriate when using a pseudo random number generator. Multithreading. Parallel execution allows applications to take advantage of hard- ware resources, but application servers must ensure that the actual execution is equivalent to executing the request batches sequentially in the order specified by the UpRight library. The simplest way to enforce this requirement is for the glue to complete execution of batch no before the execution of batch no + 1 and request i of batch no before beginning execution of request i + 1. Although we take the simple approach of executing batches and requests sequentially, more sophisticated glue may process the requests of an individual batch in parallel [50, 100] or may even support parallel execution of batches as long as all replicas generate the same output from a set of ordered batches. Some systems include “housekeeping” threads that asynchronously modify application server state. For example, an HDFS server maintains a list of live data servers, removing an uncommunicative server from the list after a timeout. An application must ensure that housekeeping threads run at well-defined points in the sequence of requests by, for example, scheduling such threads at specific points in virtual time rather than at periodic real time intervals. 7.2 Checkpoint Generation In an asynchronous system, even correct server replicas can fall arbitrarily behind, so state machine replication frameworks must provide a way to checkpoint a server replica’s state, to certify that a quorum of server replicas have produced identical checkpoints, and to transfer a certified checkpoint to a node that has fallen behind [18]. Recall from the discussions in Chapters 4 and 6 that the UpRight library periodically tells the server application to checkpoint its state to persistent memory and asks for a cryptographic hash that uniquely identifies that stable state. Further, if a replica falls behind, the library (i.e., the server shim at that replica) communicates with the other server shims to retrieve the most recent checkpoint, restarts the server application using that state, and finally replays the log of ordered requests 155 after that checkpoint to bring the replica to the current state. Given this context for how and when application checkpoints are generated and applied, there are several pragmatic concerns to consider. Application checkpoints must be (1) inexpensive to generate because the replication framework requests checkpoints at a high frequency, (2) inexpensive to apply because the replication framework uses checkpoints in both the rare case of a machine crashing and restarting and the more common case of a machine falling behind on message processing, (3) deterministic because correct nodes must generate identical checkpoints for a given request sequence number, and (4) nonintrusive on the codebase because we must not require extensive modifications of applications. There is tension among these requirements. For example, generating checkpoints more frequently increases generation cost but reduces recovery time (because the log that must be applied will be correspondingly shorter.) For example, requiring an application to store its data structures in a memory array checksummed with a Merkle tree [18] can reduce checkpoint generation and fetch time (since only changed parts need be stored or fetched) but may require intrusive changes to legacy applications. We resolve this tension through a generic checkpoint glue library that implement a checkpoint/delta approach and relies on a helper process for deterministic checkpoint generation. The checkpoint/delta approach allows the generic checkpoint glue to provide the UpRight library with the required frequent checkpoints while only rarely paying the high cost of generate the native application checkpoints. We use the generic checkpoint glue with both HDFS and ZooKeeper. The generic checkpoint glue is suitable for use with other applications, though an application specific glue can implement a different checkpoint strategy [18, 104] if needed. Checkpoint/delta approach. The checkpoint/delta approach seeks to minimize intrusiveness to legacy code by reusing existing application functionality and interposing a small amount of batch logging. We posit that most crash fault tolerant services will already have some means to checkpoint their state. So, to minimize intrusiveness, to lower barriers to adoption, and to avoid the need for projects to maintain two distinct checkpoint mechanisms, we wish to use applications’ existing checkpoint mechanisms. Unfortunately, the existing application code for generating checkpoints is likely to be suitable for 156 infrequent, coarse grained checkpoints. For example, both the HDFS and Zookeeper applications produce their checkpoints by walking their important in-memory data structures and writing their contents to persistent memory. The checkpoint/delta approach uses existing application code to take checkpoints at the approximately the same coarse-grained intervals the original systems use. We presume that these intervals are sufficiently long that the overhead is acceptable. To produce the more frequent checkpoints required by the UpRight shim, the glue library augments these infrequent, coarse-grained, application checkpoints with frequent fine-grained deltas. A delta is the log of batches since the previous delta; the log of deltas compose to form a log of batches from one checkpoint to the next. Figure 7.2 presents the checkpoint/delta approach graphically. A naive implementation of the checkpoint/delta approach produces checkpoints as shown in Figure 7.2. Specifically, each time a coarse-grained checkpoint is produced, that checkpoint is returned to the library. This naive approach has two fundamental limitations. First, it can introduce periodic latency spikes into the system if generating the coarse-grained checkpoint is a very expensive operation. Second, if an execution replica begins loading a checkpoint/delta around the same time that a coarse-grained checkpoint is produced, the replica is likely to fetch both the new and old coarse-grained checkpoints. We avoid these two issues by structuring the checkpoints produced by the checkpoint/delta approach in a similar fashion to the checkpoint and batch logs maintained by the order stage (Section 5.3.1): a checkpoint consists of one coursegrained checkpoint and sufficient deltas to reach the next course-grained checkpoint, but not enough to reach the subsequent course-grained checkpoint, as shown in Figure 7.3. This increases the time budget available to the application to produce the coarse-grained checkpoint before it is needed by the system. It also makes it possible for a recovering replica to load a single coarse-grained checkpoint and as many logs as necessary to catch up with the rest of the system, even if multiple coarse-grained checkpoints are generated while the recovery takes place. Within the checkpoint/delta approach, the application’s checkpoints must be produced deterministically. We overview several approaches below: helper processes, stop and copy, OS fork, and application copy-on-write. We use the helper process approach in our HDFS and ZooKeeper prototypes. 157 Original Application Checkpoint @ batch n Delta (n,n+100) Delta (n+100, n+200) Delta (n+200, n+300) Delta (n+300, n+400) (a) (b) (c) (d) Figure 7.2: The checkpoint/delta approach for managing application checkpoints. Original application checkpoints are taken infrequently, but the library requests a checkpoint every 100 batches. (a) shows the original application checkpoint taken after executing batch n. (b) shows the checkpoint returned to the replication library after executing batch n+100. This checkpoint consists of the application checkpoint at n and the log of the next 100 batches. (c) shows the checkpoint returned to the replication library after executing batch n + 200. (d) shows the checkpoint returned to the replication library after executing batch n + 400. 158 ... ... Checkpoint 1 Checkpoint 2 Checkpoint 3 Checkpoint 4 Figure 7.3: Checkpoint-deltas returned to the application. Each returned checkpoint-delta consists of a coarse grained application checkpoint and sufficient deltas to produce the next coarse grained checkpoint. Helper process. The helper process approach produces checkpoints asynchronously to avoid pausing request execution and seeks to minimize intrusiveness to legacy code. To ensure that different replicas produce identical checkpoints without having to pause request processing, each node runs two slightly modified instances of the server application process—a primary and a helper—to which we feed the same series of requests. We deactivate the checkpoint generation code at the primary. For the helper, we omit sending replies to clients, and we pause the sequence of incoming requests so that it is quiescent while it is producing a checkpoint. The helper process approach requires us to run two copies of the application at each replica. Surprisingly, our experiences with ZooKeeper and HDFS indicate that the overheads of this approach are not unmanageable. Stop and copy. If an application’s state is small and an application can tolerate a few tens of milliseconds of added latency, the simplest checkpoint strategy is to pause the arrival of new requests so that the application is quiescent while it writes it state to disk. Since we eliminate other sources of nondeterminism as described above, this approach suffices to ensure that replicas produce identical checkpoints for a given sequence number. Unfortunately, stop and copy is not suitable for applications that either (a) have a large amount of application state or (b) are not compatible with periodic 159 latency spikes. OS fork. Operating systems provide a fork() call that can be used to make an instantaneous copy of a process. One approach is to use fork() to create a copy of the application and then generate the checkpoint from the copy before destroying the auxiliary process. Unfortunately, on most operation systems fork() does not interact properly with the JVM and it is not uncommon to see the child process crash due to unfortunately timed garbage collection or some other background process. Application copy on write. Rather than use a helper process to produce a deterministic checkpoint, applications can be modified so that their key data structures are treated as copy on write while checkpoints are taken [19, 18, 86]. This approach can have lower performance overheads, but can require extensive application modification to support. 7.3 HDFS case study The Hadoop Distributed File System (HDFS) [43] is an open-source cluster file system modeled loosely on the Google File System [39]. It provides parallel, highthroughput access to large, write-once, read-mostly files. An HDFS deployment comprises a single NameNode and many DataNodes. Files are broken into large (default 64MB) blocks, and by default each block is stored on three DataNodes. The NameNode keeps the file name to block ID mappings and caches the block ID to DataNodes mappings reported by DataNodes as soft state. We overview the interactions between NameNodes, DataNodes, and clients in Section 7.3.1. UpRight-HDFS enhances HDFS by (1) eliminating a single point of failure and improving availability by supporting redundant NameNodes with automatic failover and (2) providing end-to-end Byzantine fault tolerance against faulty clients, DataNodes, and NameNodes. 7.3.1 Baseline system In this section we overview the basic operation of HDFS. 160 To write a new block, a client requests a new block ID from the NameNode, the NameNode selects a block ID and a list of DataNodes, the client sends a write comprising the block ID, the data, a list of 4-byte CRC32 checksums for each 512 bytes of data, and a list of DataNodes to the nearest listed DataNode, that DataNode stores the data and checksums, forwards the write to the next DataNode on the list, and reports the completed write to the NameNode. After the DataNodes acknowledge the write, the client sends a write complete request to the NameNode; the write complete request returns once the NameNode knows that the data has reached the required number of DataNodes. To read a block, a client requests a list of the block’s DataNodes from the NameNode, sends the read request to a DataNode, and gets the data and checksums in reply. DataNodes send periodic heartbeats to the NameNode. After a number of missed heartbeats, the NameNode declares the DataNode dead and replicates the failed node’s blocks from the remaining copies to other DataNodes. The NameNode checkpoints its state to a file with the help of a Secondary NameNode. The NameNode writes all transactions to a series of log files. Periodically, the Secondary fetches the most recent log file and the current checkpoint file. The Secondary then loads the checkpoint, replays the log, and writes a new checkpoint file. Finally, the Secondary sends the new checkpoint file back to the NameNode, and the NameNode can reclaim the corresponding log file. If a NameNode crashes and recovers, it first loads the checkpoint and then replays the log. The fault tolerance of the baseline HDFS system is not cleanly categorizable as “crash” or “Byzantine.” The checksums at the DataNodes protect against some but not all Byzantine failures. For example, if a DataNode suffers a fault that corrupts a disk block but not the corresponding checksum, then a client would detect the error and reject the data, but if a faulty DataNode returns the wrong block and also returns the checksum for that wrong block, a client would accept the wrong result as correct. In its default configuration, HDFS can ensure access to all data even if two DataNodes fail by omission, and it can ensure that it returns correct data for some but not all commission failures of up to two DataNodes. We will summarize HDFS DataNodes’ fault tolerance as u = 2 r = 0/2. HDFS’s Secondary NameNode’s role is just to compact the log into the checkpoint file, and there is no provision for automatically transferring control from the NameNode to the Secondary NameNode. If the NameNode suffers a catastrophic 161 failure, one could imagine manually reconfiguring the system to run the NameNode on what had been the Secondary’s hardware, but recent updates could be lost. An HDFS NameNode’s fault tolerance is u = 0 r = 0. 7.3.2 UpRight-HDFS Given the UpRight framework, adding Byzantine fault tolerance to HDFS is straightforward. UpRight-NameNode Adapting the HDFS NameNode to work with UpRight requires modifications to less than 1750 lines of code. The bulk of these changes, almost 1600 lines, relates to checkpoint management and generation. In particular, we add about 730 lines to include additional state in checkpoints. For example, we include mappings from block IDs to DataNodes in a NameNode’s checkpoints—although we still treat these mappings as soft state that expires when a DataNode is silent for too long, including this state in the checkpoint ensures that NameNode replicas processing a request agree on whether the state has expired or not. In addition, we add about 830 lines to modify the logs to record every operation that modifies any NameNode state rather than only the modifications to the file ID to block ID mapping. The other major change needed to make the HDFS NameNode compatible with UpRight is removing sources of nondeterminism from its request execution path. These changes affect under 150 lines and fall into 3 categories. We replace 5 references to local system time with references to the time provided by the order nodes for the current batch of request. Similarly, we modify 20 calls to random() so that they are all seeded by the agreed upon order time. The final step to removing nondeterminism is disabling the threads responsible for running a variety of periodic background jobs based on System.time() and instead executing those tasks based on the time specified by the order nodes. Clients. The modified HDFS NameNode corresponds to the application server in the UpRight library deployment. When deploying the service, we treat both HDFS clients and and HDFS DataNodes as application clients. Reads and writes issued by HDFS clients are processed as client requests in the UpRight library. Similarly, 162 DataNode heartbeats and notifications that a write has completed are processed as client requests in the UpRight library. UpRight-DataNode We originally imagined that we would replicate each DataNode as a BFT state machine and reduce the application-level data replication in light of the redundancy in the BFT DataNode “supernodes.” Although academically pure, simply using a black box state machine replication library to construct BFT data nodes would have changed the replication policies of the system in significant and perhaps undesirable ways. For example, HDFS’s default data placement policy is to store the first copy on a node in the same rack as the writer, the second copy on a node in another rack, the third copy on a different node in the same rack as the second, and additional copies on randomly selected, distinct nodes. Further, if a DataNode fails and is replaced, HDFS ends up spreading the recovery cost approximately evenly across the remaining DataNodes. Additionally, if a new DataNode is added, the system gradually makes use of it. Although one could imagine approximating some of these policies within a state machine replication approach, we instead leave the (presumably) carefully-considered HDFS DataNode replication policies in place (i.e., 3-way replication). These policies ensure that block writes complete if at most u = 0 of the selected DataNodes are faulty and reads complete if at most u = 2 of the selected DataNodes are faulty. Our modifications further ensure that a reads only return correct values, i.e., r = 3. To that end, our UpRight-DataNode makes a few simple changes to the existing DataNode. The main changes are to (1) add a cryptographic subblock hash on each 64KB subblock of each 64MB (by default) block and a cryptographic block hash across all of a block’s subblock hashes and (2) store each block hash at the NameNode. In particular, DataNodes compute and store subblock and block hashes on the writes they receive, and they report these block hashes to the NameNode when they complete the writes. A client includes the block hash in its write complete request to the NameNode, and the NameNode commits a write only if the client and a sufficient number of DataNodes report the same block hash. As in the existing code, clients retry on timeout, the NameNode eventually aborts writes that fail to complete, and the NameNode eventually garbage collects DataNode blocks that are 163 not included in a committed write. To read a block, a client fetches the block hash and list of DataNodes from the NameNode, fetches the subblock hashes from a DataNode, checks the subblock hashes against the block hash, fetches subblocks from a DataNode, and finally checks the subblocks against the subblock hashes; the client retries using a different DataNode if there is an error. These changes require us to change or add 189 LOC at the client, 519 lines at the DataNode, and 238 lines at the NameNode. Finally, we add the expected MACs and MAC authenticators to all messages with the exception of subblock hash and subblock data read replies from DataNodes to clients, which are directly or indirectly checked against the block hash from the NameNode. Programmer background. The modifications to HDFS were performed by a junior graduate student (Sangmin Lee) with minimal knowledge of the internals of the UpRight library. Development took a total of approximately three months, most of that time was spent learning how the internals of the HDFS codebase work. 7.3.3 Evaluation In this section we compare UpRight-HDFS with the original. Unless otherwise noted, experiments run on subsets of 107 Amazon EC2 small instances [6]. In each experiment, we have 50 DataNodes and 50 clients, and each client reads or writes a series of 1GB files. For both systems, we replicate each block to 3 DataNodes, giving u = 2, r = 2/0 for HDFS and u = 2 r = 2 for UpRight. HDFS’s NameNode is a single point of failure (u = r = 0). For the UpRight-HDFS runs, we configure the NameNodes for u = r = 1 and co-locate the RQ and order nodes. To evaluate UpRight’s ability to support CFT configurations, we also look at a u = 1 r = 0 configuration. Figure 7.4 shows the throughput achieved with 50 clients and DataNodes. For both systems, write throughput is lower than read throughput because each block is written to three disks but read from one. Even with r = 1, UpRightHDFS’s read performance is approximately equal to that of HDFS’s because only one DataNode is required to read and send the data. With r = 1, UpRight-HDFS’s write performance is over 70% of HDFS’s; the slowdown on writes appears to be 164 1,000 HDFS CFT HDFS BFT HDFS Throughput (MB/s) 800 600 400 200 0 Write Read Figure 7.4: Throughput for HDFS and UpRight-HDFS. 1,200 UpRight Core Data Node Name Node Mcycles/GB 1,000 800 600 400 Write BFT_HDFS CFT_HDFS HDFS BFT_HDFS CFT_HDFS 0 HDFS 200 Read Figure 7.5: CPU consumption (jiffies per GB of data read or written) for HDFS and UpRight-HDFS. 165 due to added agreement for the replicated NameNode and the overheads of MAC computations for the DataNodes. With r = 0, the MAC computations are omitted and write performance is over 80% of HDFS’s; the compensation for this slowdown is the ability to remain available even if a NameNode crashes. Figure 7.5 shows the CPU consumption for these workloads. When r = 1, UpRight-HDFS’s CPU costs are within a factor of 2.5 of the original for writes and within a factor of two for reads. Note that CPU consumption is one of the worst metrics for UpRight-HDFS; other system resources like the disks and networks have much lower overheads. When r = 0, the overheads are smaller—factors of 1.1 and 1.6 for writes and reads, respectively. We also note that the computational cycles for these workloads are dominated by the work performed at the DataNodes and not the NameNode replicated with the UpRight library. file read completely 50 namenode corrupted 30 namenode restarted 20 namenode corrupted 40 Request ID Request ID 40 file read completely 50 10 30 namenode restarted 20 10 0 0 0 5 10 15 20 25 30 0 5 10 15 20 Time (second) Time (second) (a) (b) 25 30 Figure 7.6: Completion time for requests issued by a single client. In (a), the HDFS NameNode fails and is unable to recover. In (b), a single UpRight-HDFS NameNode fails, and the system continues correctly. UpRight-HDFS incurs additional computational overheads for lower performance than HDFS. These costs come with a benefit as demonstrated by Figure 7.6. The two graphs plot completion time for requests issued by a single client that issues each request .5 seconds after the previous request completes. After 10 seconds of this workload we kill a NameNode and in the process corrupt its checkpoint log. We then restart the NameNode after an additional 5 seconds. Progress with the HDFS NameNode stops at 10 seconds when the log becomes corrupted. When the NameNode restarts 5 seconds later it immediately crashes again after attempting 166 to load the corrupted log. In UpRight-HDFS, the absence of a single NameNode does not prevent progress. Additionally, when the failed NameNode restarts, it fetches a valid state from the other replicas and resumes correct operation rather than attempting to load its corrupted local log. 7.3.4 MapReduce MapReduce is an application frequently run on top of HDFS. In Figure 7.7 we report the execution times of the TeraSort and TeraGen MapReduce workloads. TeraGen generates 100,000,000 random 100 byte entries and TeraSort sorts the generated data. This set of experiments is run on a collection of 4 core 2.4Ghz processors and 8GB of RAM. There are 20 DataNodes in the experiments, with 20 map tasks for TeraGen and 20 reducers for TeraSort all running on the DataNodes. HDFS is configured to use 3 way data replication. Our current UpRight implementation allows clients to have at most one request outstanding at any time and uses a single proxy client per machine, regardless of how many tasks are running on that machine. In this experiment, each mapper, reducer, and DataNode process on a single machine shares one UpRight client proxy. Our results indicate that the UpRight library imposes a modest overhead on overal execution time. We believe this overhead can be reduced by improving the implementation of both the UpRight library and the interactions between the application and library at the client side application. Specifically, we believe that engineering UpRight to support multiple outstanding requests per client or to have a client per task rather than a single client per machine would improve performance. 7.4 ZooKeeper case study ZooKeeper [108] is an open-source coordination service that, in the spirit of Chubby [12], provides services like consensus, group management, leader election, presence protocols, and consistent storage for small files. ZooKeeper guards against omission failures. However, because data centers typically run a single instance of a coordination service on which many cluster services depend [19], and because even a small control error can have dramatic 167 180 160 HDFS UR HDFS 1/0 UR HDFS 1/1 Execution Time (Seconds) 140 120 100 80 60 40 20 0 TeraGen TeraSort Figure 7.7: Execution time for TeraGen and TeraSort MapReduce workloads. effects [97], investing modest additional resources to protect the service against a wider range of faults may be attractive. 7.4.1 Baseline system A ZooKeeper deployment comprises 2u + 1 servers; a common configuration is 5 servers for u = 2 r = 0. Servers maintain a set of hierarchically named objects in memory. Writes are serialized via a Paxos-like protocol, and reads are optimized to avoid consensus where possible [18]. A client can set a watch on an object so that it is notified if the object changes unless the connection from the client to a server breaks, in which case the client is notified that the connection broke. For crash tolerance, each server synchronously logs updates to stable storage. Servers periodically produce fuzzy snapshots to checkpoint their state: a thread walks the server’s data structures and writes them to disk, but requests concurrent with snapshot production may alter these data structures as the snapshot is produced. If a ZooKeeper server starts producing a snapshot after request sstart and finishes producing it after request send , the fuzzy snapshot representing the system’s state after request send comprises the data structures written to disk plus the log of updates from sstart to send . 168 7.4.2 UpRight-ZooKeeper UpRight-ZooKeeper is based on ZooKeeper version 3.0.1. Given the UpRight framework, adding Byzantine fault tolerance to ZooKeeper to produce UpRight-ZooKeeper is straightforward. Our shims use standard techniques to add authenticators to messages and to send/receive them to/from the right quorums of nodes. We use the techniques described above to support watches via server push, to make time-based events happen deterministically across replicas at the same virtual time, and to canonicalize read-only replies. ZooKeeper’s fuzzy snapshots align well with our hybrid checkpoint/delta approach; we modify ZooKeeper to make the snapshots deterministic and identical across replicas using helper-process approach. The original ZooKeeper server comprises 13589 lines of code (LOC). We add or modify 604 lines to integrate it with UpRight. The bulk of these changes involved modifying the checkpoint generation code to include all required state and integrate a helper process for use with the hybrid checkpoint/delta approach (347 LOC), glue code to handle communication between ZooKeeper and the UpRight and checkpoint/delta libraries (129 LOC), and making references to time and randomness deterministic across replicas (66 LOC). We also deactivate or delete some existing code. In particular, we delete 342 LOC that deal with asynchronous IO and multithreading, and we no longer use 5644 LOC that handle ZooKeeper’s original replication protocols. Programmer background. The modifications to ZooKeeper were performed by a pair of junior graduate students (Manos Kapritos and Yang Wang) with minimal knowledge of the internals of the UpRight library. Development took a total of approximately three months, most of that time was spent learning how the internals of the ZooKeeper codebase work. 7.4.3 Evaluation We evaluate ZooKeeper 3.0.1 and UpRight-ZooKeeper running on the hardware described in Section 6.5. For ZooKeeper, we run with the default 5 servers (u = 2 r = 0). We then configure UpRight-ZooKeeper to tolerate as many or more faults. In particular, we examine UpRight-ZooKeeper with u = 2 r = 1 for all phases to minimize the replication cost of adding commission failure tolerance while 169 60 ZK CFT ZK 2/0 BFT ZK 2/1 Throughput (Kops/s) 50 40 30 20 10 0 Write_Only Serial_Read 90/10_Mix Figure 7.8: Throughput for UpRight-ZooKeeper and ZooKeeper for workloads comprising different mixes of 1KB reads and writes. retaining at least ZooKeeper’s original omission failure tolerance. We also examine a configuration that we refer to as u =2+ r = 1 that has u = 2 r = 1 for the RQ and order stages and uexec = 3 rexec = 1 for the execution stage; this configuration retains ZooKeeper’s default 5 execution replicas. The results presented here rely on the helper process approach for checkpointing. We observe similar performance when using copy on write techniques. In addition, we evaluate UpRight-ZooKeeper’s performance in CFT configurations (r = 0) to explore whether UpRight would be a suitable for new applications that want to support both CFT and BFT configurations using a single library. We evaluate the performance of UpRight-ZooKeeper with u = 2 r = 0 to match ZooKeeper’s omission tolerance with the minimum degree of replication. We also evaluate a configuration that we refer to as u =2+ r = 0 that has u = 2 r = 0 for the RQ and order stages and uexec = 4 rexec = 0 for the execution stage; this configuration retains ZooKeeper’s default 5 execution replicas. Figure 7.8 shows throughput for different mixes of 1KB reads and writes. For writes, the systems sustain several thousand requests per second. Nearly a decade of effort to improve various aspects of BFT agreement [1, 18, 24, 26, 49, 50, 92, 100, 104, 107] have paid off: when r = 1, UpRight-ZooKeeper’s write throughput is 77% of ZooKeeper’s for both u = 2 and u =2+. UpRight also appears to provide competitive write performance for CFT configurations: for u = 2 or u =2+ and r = 0 UpRight-ZooKeeper’s throughput with r = 0 and either u = 2 or u =2+ is 170 more than 111% of ZooKeeper’s. For reads that can accept serializability for their consistency semantics, both ZooKeeper and UpRight-ZooKeeper exploit the read-only optimization to skip agreement and issue requests to a quorum of r + 1 execution nodes that have processed the reader’s most recent write. Both systems’ read throughputs are many times their write throughputs, but in configurations where ZooKeeper queries fewer execution nodes or has more total execution nodes, its peak throughput can be proportionally higher. For example, when ZooKeeper sends read requests to 1 server and spreads these requests across 5 execution replicas, we expect to see about 2.5 times the throughput of a configuration where UpRight-ZooKeeper sends read requests to 2 servers (for r = 1) and spreads them across 4 execution replicas. When UpRight-ZooKeeper is configured to tolerate commission failures, it pays additional CPU overheads for cryptographic checksums but saves some network overheads by having only one execution node send a full response and having the others send a hash [18]. Overall, UpRight-ZooKeeper’s serializable read throughput ranges from 17.5 Kops/s to 43.4 Kops/s, which is 34% to 85% of ZooKeeper’s 51.1 Kops/s throughput. Although reading identical results from a properly chosen quorum of r + 1 servers can guarantee that the read can be sequenced in a global total order, the position in the sequence may not be consistent with real time: a read by one client may not reflect the most recently completed write by another. So, some applications may opt for the stronger semantics of linearizability. For linearizable reads, UpRightZooKeeper can still use the read only optimization, but it must increase the read quorum size to nexec − rexec . To enforce linearizability the original ZooKeeper issues a sync request through the agreement protocol and then issues a read to the same server, which ensures that server has seen all updates that completed before the sync. The last group of bars examines performance for a mix of 90% serializable reads and 10% writes. When UpRight-ZooKeeper is configured to tolerate r = 1 commission failures, its performance is over 66% of ZooKeeper’s. When it is configured to tolerate omission failures only, its performance is comparable to ZooKeeper’s. Although the throughputs of our BFT configurations are comparable to those of the original CFT system, the extra guarantees come at a cost of resource consumption. Figure 7.9 shows that each request consumes significantly more CPU cycles 171 10 9 RQ Order Execution Mcycles/request 8 7 6 5 4 3 2 Write Only Read Only BFT_ZK_2/1 CFT_ZK_2/0 ZK BFT_ZK_2/1 CFT_ZK_2/0 ZK BFT_ZK_2/1 CFT_ZK_2/0 0 ZK 1 90/10 mix Figure 7.9: Per-request CPU consumption for UpRight-ZooKeeper and ZooKeeper for a write-only workload. The y axis is in jiffies. In our system, one jiffy is 4 ms of CPU consumption. under UpRight-ZooKeeper than under ZooKeeper. The graph shows per-request CPU consumption when both systems are heavily loaded; we observe similar results across a wide range of loads. We note that although using Java rather than C for agreement only modestly hurts our throughput for this application, it does significantly increase our resource consumption. Judging by peak throughputs on similar hardware, agreement protocols like PBFT and Zyzzyva may consume an order of magnitude fewer CPU cycles per request than our Zyzzyvark implementation. Future work is needed to see if a C realization of UpRight’s agreement protocol would provide a lower cost option for deployments willing to shift from Java to C. Figure 7.10 shows how throughput varies over time as nodes crash and recover. For this experiment we compare against ZooKeeper 3.1.1 because it fixes a bug in version 3.0.1’s log garbage collection that prevents this experiment from completing. The workload is a series of 1KB writes generated by 16 clients, and we compare ZooKeeper (u = 2 r = 0) with UpRight-ZooKeeper configured with u =2+ r = 1. At times 30, 270, 510, 750, and 990 we kill a single execution node and restart it 60 seconds later. At time 1230 we kill all execution nodes and restart them 20 seconds later. Both systems successfully mask partial failures and recover 172 1400 Zookeeper UpRight Zookeeper kill all restart all 1200 Throughtput (Reqs/sec) 1000 800 600 400 200 kill0 restart0 kill1 restart1 kill2 restart2 kill3 restart3 kill4 restart4 0 0 200 400 600 800 1000 1200 1400 Time (seconds) Figure 7.10: Performance v. time as machines crash and recover for ZooKeeper and UpRight-ZooKeeper. quickly after a system-wide crash-recover event. 7.5 Conclusion and Discussion In this chapter we relate our experience modifying HDFS and ZooKeeper to be compatible with the UpRight library. We take three lessons from our experience. First, the changes required to make an existing (Java) application UpRight compliant are modest in scope and do not require extensive knowledge of (BFT) replication. In concrete terms, we modified approximately 2500 lines of code (out of 37,000) in HDFS (between the NameNode and the DataNode) and 600 lines of code (out of 13,500) in ZooKeeper. These modifications were made by junior students that did not know the details of the replication library. They reported that the ability to make the system fail stop by setting u = 0 and r = 1 facilitated their development by highlighting the presence of non-determinism and aiding in identifying the source of the non-determinism. We believe this is a step forward in comparison to previous replication libraries that are either integrated tightly into the 173 application (i.e., ZooKeeper [108], Q/U [1], and Chubby [12]) or require extensive application modifications to fit a library defined memory model (i.e., PBFT [18, 86], Zyzzyva [49], Aardvark [24], and others [50, 104, 107]). Second, building a replication library to provide UpRight fault tolerance transforms the question of Byzantine or crash fault tolerance from a design decision to a configuration decision. With a single library, and a single application code base, we are able to provide Byzantine, crash, or hybrid fault tolerance. We believe this is an important step to facilitating adoption of BFT replication in production applications. Finally, the performance of UpRight applications can be competitive with the original code bases. Despite the fact that we made the conscious decision to keep the application modifications simple and to reuse application functionality when possible rather than optimizing the application and environment, we observe that the performance of an UpRight system is within 25% of the performance of the original system for most workloads. 174 Chapter 8 Background and state machine replication There is a large body of research on fault tolerance and state machine replication. This thesis builds on much of that work and refines and incorporates ideas developed by a multitude of other researchers. Section 8.1 discusses the foundations of state machine replication. Section 8.2 discusses a variety of work on consensus and quorum systems that lies at the core of most RSM protocols. Section 8.3 discusses contemporary replication libraries developed as part of the effort to demonstrate that Byzantine fault tolerance and poor performance are not synonymous. Section 8.4 discusses previous work related to the performance of fault tolerant systems in the presence of failures. Section 8.5 discusses current commercial best practices for building reliable systems. 8.1 RSM approach State machine replication is a powerful technique for building reliable services from faulty components [52, 88]. The basic idea behind state machine replication is simple: as long as every replica executes the same sequence of requests then correct replicas will provide the same responses to those requests and the collection of potentially faulty components can be viewed as a single correct node. There is a large body of previous work on the development of asynchronous replicated state machine (RSM) prototypes [1, 18, 24, 26, 49, 50, 92, 100, 104, 107] and deployed 175 systems [12, 108] based on the Paxos RSM protocol [53]. The primary objective of RSM protocols is to ensure that the end-to-end service remains both safe, e.g. correct, and live, e.g. available, despite the failure of individual replicas. The network connecting replicas in these systems is assumed to be asynchronous and consequently capable of arbitrarily delaying, reordering, or dropping messages. In asynchronous environments where nodes are allowed to fail, it is impossible to insure that non-trivial systems will remain both safe and live [35]. RSM protocols are consequently designed to be fault tolerant. A protocol is fault tolerant if, despite a bounded number of failures, it is (1) safe always and (2) live provided that the network is sufficiently well behaved. 8.2 Consensus The core unit of every RSM protocol is a consensus, or agreement, protocol. There is a large body of work on synchronous [79, 61, 27, 84, 64, 38, 47, 29, 59] and asynchronous [11, 15, 33, 65, 30, 13, 54, 56, 53, 54, 57, 58, 60, 69, 70, 68, 32, 35] consensus that establishes when it is possible to solve consensus and the number of replicas required. While the full body of previous work on consensus informs our design and implementation, the work by Lamport [56, 60], Dutta et al. [32], and Martin et al. [68] is especially important. These works explore circumstances when the standard 3f + 1 replicas are not required to solve consensus and provide the foundation for the protocols we use to replicate the authentication, order, and execution stages of the UpRight architecture. 8.3 Recent RSM history Our work builds on a number of previous asynchronous RSM prototypes [18, 42, 86, 1, 26, 45, 50, 49, 92, 104, 107]. Historically, BFT state machine replication was widely considered to be inefficient and fundamentally inappropriate for use in deployed systems. This belief held until Castro and Liskov provided a practical BFT NFS implementation [18]. Their protocol, PBFT, is based on a three-phase commit protocol that uses MACs, rather than digital signatures, for message authentications. Many subsequent BFT 176 RSM systems [86, 26, 42, 50, 49, 92, 104, 107] are inspired by PBFT. The systems directly inspired by PBFT can be broken down into three categories. Systems in the first set [26, 49, 92, 42] attempts to optimize throughput and latency by taking advantage of situations in which Byzantine consensus can be solved using two, rather than three, phase commit [42, 49, 92] or without requiring any all-to-all communication steps [26, 42]. Systems in the second set [104, 107] reduce deployment costs by leveraging the disparity between the number of replicas required to agree on the order of requests and the number of replicas required to execute the requests. Systems in the third set optimize performance by facilitating parallel execution [50] or simplify development [86] through an object based API. The UpRight architecture extends the separation of agreement and execution emplyed by the second set of systems while the replicated order stage is based on similar techniques to those developed in the first set of systems. The work in the third set of systems is orthogonal to the UpRight library. Another thread of previous work [45, 1] differentiates itself from the PBFT lineage by explicitly basing its replication protocols on quorums, rather than consensus. While the protocols at the core of these systems do not share many obvious similarities with the systems in the PBFT lineage, careful consideration of generalized Paxos consensus [53] and the various special case replication requirements for implementing consensus in two step [32, 56, 68] indicate that the underlying quorum protocols are in fact very specific special cases of consensus. The protocols we use for the replicated authentication and execution stages are similar to the quorum protocols used in this lineage of work. 8.4 Performance with failures We are not the first to notice significantly reduced performance for BFT protocols during periods of failures or bad network performance or to explore how timing and failure assumptions impact performance and liveness of fault tolerant systems. Singh et al. [95] show that PBFT [18], Q/U [1], HQ [26], and Zyzzyva [49] are all sensitive to network performance. They provide a thorough examination of the gracious executions of the four canonical systems through a ns2 [76] network simulator. Singh et al. explore performance properties when the participants are well behaved and the network is faulty; we focus our attention on the dual scenario 177 where the participants are faulty and the network is well behaved. Aiyer et al. [4] and Amir et al. [7] note that a slow primary can result in dramatically reduced throughput. Aiyer et al. combat this problem by frequently rotating the primary. Amir et al. address the challenge instead by introducing a pre-agreement protocol requiring several all-to-all message exchanges and using signatures for all authentication. Condie et al. [25] address the ability of a well placed adversary to disrupt the performance of an overlay network by frequently restructuring the overlay, effectively changing its view. The signature processing and scheduling of replica messages in Aardvark is similar in flavor to the early rejection techniques employed by the LOCKSS system [40, 66] in order to improve performance and limit the damage an adversary can inflict on system. 8.5 Application fault tolerance Commercial best practices for replication have evolved towards increasing tolerance to fail-stop faults as hardware costs fall, as replication techniques become better understood and easier to adopt, and as systems become larger, more complex, and more important. For example, once it was typical for storage systems to recover from media failures using off-line backups; then single-parity or mirrored RAID [20] became de rigeur; now, there appears to be increasingly routine use of doublyredundant storage [39, 90, 81]. Similarly, although two-phase commit is often good enough—in the absence of commission failures it can be always safe and rarely unlive—increasing numbers of deployments pay the extra cost to use Paxos [53, 77] three-phase commit [12, 99] to simplify their design or avoid corner cases requiring operator intervention [12]. Failed processes and hardware are not always polite enough to stop cleanly. Instead, they may continue to operate and provide incorrect outputs or corrupt internal state for a variety of reasons including bad NICs [2], soft CPU errors [94], memory errors [19], disk errors [82, 90, 81], and software Heisenbugs [105]. Deployed systems increasingly include limited Byzantine fault tolerance aimed at high-risk subsystems. For example the ZFS [85], GFS [39], and HDFS [43] file systems provide checksums for on-disk data [82]. As another example, after Amazon S3 was felled for several hours by a flipped bit, additional checksums on system 178 state messages were added [97]. Although it may be cheaper to check for and correct faults at critical points than to do so end-to-end, we fear that it may be difficult to identify all significant vulnerabilities a priori and complex to solve them case by case with ad hoc techniques. We demonstrate that end-to-end techniques can be applied to existing applications. 179 Chapter 9 Conclusion This thesis describes the design, implementation, and deployment of the UpRight replication library. More importantly, it presents a concrete step towards making Byzantine fault tolerance a deployable option for general computing systems. We believe that the UpRight library eases the path to adopting Byzantine fault tolerance in two important ways. First, the UpRight library provides both crash and Byzantine faul tolerance in a single code base. Flexible fault tolerance encourages incremental adoption of Byzantine fault tolerance by removing the need to maintain multiple code bases and allowing sysadmins to “add” Byzantine fault tolerance to an existing system by adding additional resources and changing a configuration parameter rather than deploying and supporting an entirely different system. Second, the application interface provided by the UpRight library is not onerous; our experience indicates that programmers unfamiliar with the details of the replication library can port legacy applications with only nominal effort. In addition to the practical benefits mentioned above, this thesis makes three important conceptual contributions that improve the understanding of fault tolerance and state machine replication. First, we refine the definition of fault tolerance to more accurately reflect the needs of deployed systems. Our refinement comes in two parts. First, we reject the traditional dichotomy between crash and Byzantine fault tolerance and instead embrace the Upright failure model (Chapter 2). Embracing the UpRight model allows system developers to ask “Do I want fault tolerance or not?” rather than decide in advance whether Byzantine or crash fault tolerance is appropriate for the deploy- 180 ment environments. Second, we reject the exclusive focus on best-case performance and observe that fault tolerant systems should provide good performance even when failures occur (Chapter 3). Second, we clarify the definition of state machine replication. We refine the responsibilities of the replication library and the application (Chapter 4) and revisit the key functional pieces of state machine replication (Chapter 5). With respect to the responsibilities of the library and the application, we emphasize that the library is responsible for delivering batches of requests to application replicas in a single order. The application replicas are in turn responsible for executing those batches deterministically and providing, on-demand, determinist checkpoints of their state. With respect to the functional pieces of state machine replication, we observe that request authentication must be added to the traditional steps of order, agree, and execute1 . Third, we clarify the design of replication protocols around variations of consensus (Chapter 6). By mapping the interactions between nodes in the system to a consensus problem we are able to better understand the requirements of each component of the system and leverage the existing body of work on consensus. Moving forward, the recognition that state machine replication can be described as a collection of consensus protocols should make it easier to understand new and existing protocols and also highlights the fundamental differences between systems. While we have made it easier to understand and deploy Byzantine fault tolerant systems, there are still significant barriers to wide spread adoption. Chief among these barriers to adoption is the widespread belief that “Byzantine failures just don’t happen.” If true, this implies that Byzantine fault tolerant systems are a luxury that is not needed in a general computing environment. The next round of Byzantine fault tolerant systems research consequently should focus on deployment, failure tracking, and failure analysis. The key questions to answer are (a) what fraction of failures can be masked by BFT techniques and not CFT techniques and (b) what is the real impact of these failures. 1 Note that order and agree are frequently merged into a single step. 181 Appendix A UpRight Library Byte Specifications This Appendix provides the full byte definition for all data structures that are sent across the network or placed on disk in the UpRight library. This appendix provides the byte specification for the inter-stage messages, order stage checkpoints, execution stage checkpoints, and intra-execution stage messages. We do not include the byte specification for intra-order stage messages. A.1 Basic Message Structure All messages in the UpRight library conform to the basic structure shown in Figure A.11 . Every message contains (a) a 2 byte message tag, (b) a 4 byte payload size, (c) a payload of the specified size, and (d) a block of bytes dedicated to authentication as shown in Figure A.1. We implement three distinct authentication strategies: (1) simple MAC authentication, (2) MAC authenticator authentication, and (3) matrix signature [3] authentication. We use MD5 for digests/hashes and SHA1 for MAC authentication. In our current implementation, an individual MAC is 16 bytes and a digest is 20 bytes. For subsequent message definitions we will indicate which of the au1 The fields in Figure A.1 and all other figures in this chapter are presented in the order they appear. The sizes of fields in the figures do not correlate with the byte size of the implementations. 182 Tag Payload Size Payload Authentication Block Figure A.1: Messages are built upon a verified message base. This basis byte structure contains 4 fields: tag, payload size, payload, authentication thentication types are being used and describe the byte specification for the payload of the specific message. Simple MAC authentication A MAC is a shared private key between a pair of nodes. Authenticating a MAC ensures that one of the nodes sharing that key generated the message. Messages authenticated with a MAC follow the structure shown in Figure A.2. The authentication block of MAC messages contains a 4 byte sender field and a 20 byte MAC. The MAC is computed over the tag, payload size, payload, and sender fields of the message. MAC authenticator authentication A MAC authenticator [18] is an array of MACs designed to provide authentication to multiple recipients. The byte layout of a MAC authenticator message is shown in Figure A.3. The authentication block of a MAC authenticator message consists of (a) a 4 byte sender field, (b) a 16 byte digest of the tag, payload size, payload, and sender fields, and (c) one 20 byte MAC per recipient. The digest is computed over the tag, payload size, payload and sender fields. For efficiency, the MACs are computed over the digest. 183 Tag Payload Size Payload sender MAC Figure A.2: Basic byte structure of a message with simple MAC authentication. Tag Payload Size Payload sender Digest MAC MAC MAC MAC Figure A.3: Byte definition for a message authenticated with a MAC array. The sender is the replica responsible for generating the MACs, the Digest field is a digest of the tag, payload size, and sender fields. The MACs are generated using the byte representation of the digest rather than the full message. 184 Tag Payload Size Payload MAC, s1->r0 MAC, s1 -> r1 Macs generated by sender 1 MAC, s1 -> r2 MAC, s1 -> r3 MAC, s2->r0 MAC, s2 -> r1 Macs generated by sender 2 MAC, s2 -> r2 MAC, s2 -> r3 MAC, s3->r0 MAC, s3 -> r1 Macs generated by sender 3 MAC, s3 -> r2 MAC, s3 -> r3 Figure A.4: Message authenticated with a matrix signature. The authentiation block of these messages consists of a collection of MAC Arrays that each authenticate the tag, size and payload. Matrix signature authentication Matrix signatures [3] are a technique that provide the strong properties of digital signatures (specifically forwardability) at the lower costs afforded by MACs. A matrix signature consists of a collection of MAC authenticators from muliple senders. A recipient considers a matrix signature valid if it can authenticate a threshold of th MAC authenticators. The byte layout of a matrix signature message is shown in Figure A.4. The authentication block of a matrix signature message consists of (a) a 16 byte digest of the tag, payload size, and payload fields, (b) followed by k MAC authenticators with |recipient set| 20 byte MACs each. For efficiency, the individual MACs are computed over a digest of the tag, payload size, and payload fields of the message. 185 Message Tag 1 (regular) hclient-req, hreq-core, c, nc , opi, ciµ~ c,F 16 (read only) hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµ~ f,O 19 hcommand, no , c, nc , op, f iµf,e 22 htoCache, c, nc , op, f iµ~ f,E 25 11 (speculative) hnext-batch, v, no , H, B, t, bool, oiµ~ o,E 12 (tentative) 13 (committed) hrequest-cp, no , oiµ~ o,E 10 hretransmit, c, no , oiµ~ o,E 4 hload-cp, Tcp , no , oiµo,e 5 hbatch-complete, v, no , C, eiµ~ e,F 20 hfetch, no , c, nc , hash(op), eiµ~ e,F 21 hcp-up, no , C, eiµ~ e,F 24 hlast-exec, ne , eiµ~ e,O 6 hcp-token, no , Tcp , eiµ~ e,O 7 hcp-loaded, no , eiµ~ e,O 14 8 (regular) hreply, nc , R, H, e, iµe,c 15 (watch) 17 (readonly) Table A.1: Message Tags for all intra-node messages. A.2 Inter-stage messages This section defines the byte specification and message tags for all messages exchanged betweeen clients, filter, order, and execution nodes. Details on messages that are internal to the execution stage can be found in Sections A.3. A.2.1 Message Tags The specific message tags used for all inter-node messages are shown in Table A.1. A.2.2 Inter-stage messages This section defines the payload structure for all messages that pertain directly to client requests. 186 Command/ Digest Flag 1 byte Client ID 4 bytes Request ID 4 bytes Command size 4 bytes Command command size bytes Figure A.5: Byte Specification of the Entry at the core of every request. Entry. All messages which contain a request in their payload are built around a common entry data structure shown in Figure A.5. An entry consists of five fields: (1) a 1 byte flag indicating if the entry contains a command or a digest of a command, (2) a 4 byte identifier of the client that issued the command, (3) a 4 byte request identifier, (4) the size, in bytes, of the command (or command digest), and (5) the command (or digest) itself. Client Requests. A hclient-req, hreq-core, c, nc , opi, ciµ~ c,F message relies on the MAC authenticator byte layout. The payload of the message is an entry shown in Figure A.5. Filtered Requests. A hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµf,o mes- sage relies on the MAC authenticator byte layout. The implementation of each filtered request message contains one or more requests that have been individually validated by the sending filter replica. The payload of a filtered request message is a 2 byte integer k followed by k authenticated entries. An authenticated entry is a message authenticated by a matrix signature whose payload is an entry. The payload of a filtered request message is shown in Figure A.6. Forward Requests. A hcommand, no , c, nc , op, f iµf,e message relies on MAC au- thentication. The payload of a forward request message is a 4 byte sequence number followed by a request entry. The entry in a forward request is always a command 187 Number of entries Authenticated Entry ... Authenticated Entry Figure A.6: Byte Specification of the payload of a hauth-req, hreq-core, c, nc , hash(op)iµ~ f,O , f iµf,o message. Sequence Number Entry Figure A.7: Byte Specification of the payload of a hcommand, no , c, nc , op, f iµf,e message. itself and not a digest. The byte format is shown in Figure A.7 Speculatively Forwarded Requests. A htoCache, c, nc , op, f iµ~ f,E messages uses the MAC authenticator authentication byte layout. The payload of this message is a request entry shown in Figure A.5. Next Batch. A hnext-batch, v, no , H, B, t, bool, oiµ~ o,E message uses the MAC authenticator byte layout. The payload of the message is shown in Figure A.8 and consists of 9 fields: (1) a 4 byte view number, (2) a 4 byte sequence number, (3) a 16 byte history digest, (4) a 16 byte checkpoint digest, (5) a 2 byte boolean, (6) a 2 byte integer for the byte size of encoding non-determinism and time, (7) encoded nondeterminism and time, (8) a 2 byte integer representing the number of commands 188 View Number Sequence Number History Digest Checkpoint Digest Take CP NonDetSize Non-determinism Batch Size (bytes) Number of Entries Entries Figure A.8: Byte Specification of a hnext-batch, v, no , H, B, t, bool, oiµ~ o,E message in the batch, and (9) an entry per command in the bath. Non-determinism is encoded as a pair of 8 byte numbers corresponding to time and a seed for a pseudo random number generator as shown in Figure A.9. There are three different types of NextBatch messages corresponding to the level of agreement achieved by the order node: speculative, tentative, and commmitted. Replies A hreply, nc , R, H, e, iµe,c message is based on the simple MAC authenti- cation byte layout. The payload of a reply consists of (a) a 4 byte sequence number, (b) a 4 byte encoding of the size of the reply, (c) and the reply itself as shown in Figure A.10. 189 Time Random seed Figure A.9: Byte encoding of non-determinism. The two fields correspond to time and a seed for random number generation. Request ID Response size Response Figure A.10: Byte Specification of the hreply, nc , R, H, e, iµe,c message. Checkpoint request. A hrequest-cp, no , oiµ~ o,E message relies on the MAC au- thenticator byte layout. The payload of a checkpoint request consists of a 4 byte sequence number for the checkpoint being requested as shown in Figure A.11. Checkpoint release. A hrelease-cp, Tcp , no , oiµ~ o,E message relies on the MAC authenticator byte layout. The payload of a checkpoint release message consists of (a) a four byte sequence number of the checkpoint to be released, (b) a four byte length of the checkpoint token, and (c) the checkpoint token itself as shown in Figure A.12. Retransmit. A hretransmit, c, no , oiµ~ o,E message relies on the MAC authenti- cator byte layout. The payload of a retransmit message consists of (a) a 4 byte Sequence Number Figure A.11: Byte Specification of the payload for a hrequest-cp, no , oiµ~ o,E message. 190 Sequence Number Token size Token Data Figure A.12: Byte Specification of the payload for a hrelease-cp, Tcp , no , oiµ~ o,E message. Client ID Figure A.13: Byte Specification of the payload for a hretransmit, c, o, µ ~ o,E im essage. client identifier and (b) a 4 byte batch identifier as shown in Figure A.13. Load checkpoint. A hload-cp, Tcp , no , oiµo,e message uses the simple MAC au- thentication byte layout. The fields of a load checkpoint message are (a) a 4 byte sequence number, (b) a 4 byte length of a checkpoint descriptor, and (c) the checkpoint to be loaded as shown in Figure A.14. Batch Completed. A hbatch-complete, v, no , C, eiµ~ e,F message relies on the MAC authenticator byte layout. The payload consists of (a) a 4 byte view number, Sequence Number Token Size Exec CP token Figure A.14: Byte Specification of the payload for a hload-cp, Tcp , no , oiµo,e message. 191 View Number SequenceNumber Batch size Number of entries Entry ... Entry Figure A.15: Byte specification of a hbatch-complete, v, no , C, eiµ~ e,F message. (b) a 4 byte sequence number, (c) a 4 byte count of the subsequent bytes in the payload, (d) a 2 byte count k of the number of contained entries, and (e) k entries as shown in Figure A.15. The byte layout of each entry is shown in Figure A.5. Fetch Command. A hfetch, no , c, nc , hash(op), eiµ~ e,F message relies on the MAC authenticator byte layout. The payload consists of a 4 byte sequence number and an entry as shown in Figure A.16. The byte layout of the entry is shown in Figure A.5. Sequence Number Entry Figure A.16: Byte specification of a hfetch, no , c, nc , hash(op), eiµ~ e,F message. 192 Sequence Number Last executed Total Clients in System ... Last executed Figure A.17: Byte specification of a hcp-up, no , C, eiµ~ e,F message. Sequence Number Figure A.18: Byte Specification hcp-loaded, no , eiµ~ e,O messages. Checkpoint Update. of hlast-exec, ne , eiµ~ e,O and A hcp-up, no , C, eiµ~ e,F message relies on the MAC authen- ticator byte layout. The payload consists of (a) a 4 byte sequence number and (b) the 4 byte identifier of the most recent request executed for each client as shown in Figure A.17. The checkpoint update message is used in place of a batch completed message when the execution stage processes retransmission instructions. Last executed. A hlast-exec, ne , eiµ~ e,O message relies on the MAC authentica- tor byte layout. The payload of a last executed message is a 4 byte sequence number shown in Figure A.18. Checkpoint loaded. A hcp-loaded, no , eiµ~ e,O uses the MAC authenticator byte layout. The payload consist of a 4 byte sequence number shown in Figure A.18. Note that the checkpoint loaded message is used as an efficient replacement for a last executed message. The handling of the two messages is identical, except that instructions to load a checkpoint are never sent in response to a checkpoint loaded notification. Checkpoint message. A hcp-token, no , Tcp , eiµ~ e,O message relies on the MAC authenticator byte layout. The payload of the message consists of (a) a 4 byte sequence number, (b) the size of the checkpoint, and (c) the checkpoint as shown in 193 Sequence Number Token Size Exec CP token Figure A.19: Byte specification for the payload of a hcp-token, no , Tcp , eiµ~ e,O message. Message Tag hfetch-exec-cp, n, eiµ~ e,E 70 hexec-cp-state, n, S, eiµe,e′ 71 hfetch-state, Tstate , eiµ~ e,E 72 hstate, Tstate , S, eiµe,e′ 73 Table A.2: Set of messages for intra-node communication Figure A.19. A.2.3 Order stage checkpoint During normal operation, the order stage periodically records checkpoints to disk. The contents of a checkpoint are shown in Figure A.20. The basic layout of bytes when serializing an order checkpoint is shown in Figure A.21. The serialization consists of (a) a 16 bytes history digest, (b) an 8 byte time, (c) a 4 byte sequence number, (d) a 2 byte count of the number of clients k in the system, (e) k pairs of 4 byte request identifiers and 4 byte sequence numbers, and (f) the execution checkpoint. A.3 A.3.1 Execution node specifications Message Tags The message tags used for all intra-execution messages are shown in Table A.2. 194 lastOrdered Next Batch Identifier History Time Client Request Batch ID ID Execution Checkpoint Token Figure A.20: Order node checkpoint. History Time Sequence Number Base Sequence Number Number of Clients Request ID Seq. Number ... Request ID Seq. Number Execution Checkpoint Figure A.21: Order node checkpoint byte specification. 195 replyCache Next Batch Identifier Client Reply Application Checkpoint Figure A.22: Exec node checkpoint. A.3.2 Execution checkpoints During normal operation, order nodes record checkpoints to disk. The contents of a checkpoint are shown in Figure A.22. The basic layout of bytes when serializing an order checkpoint is shown in Figure A.23. The serialization consists of (a) a 4 byte base sequence number, (b) a 4 byte current sequence number, (c) a 4 byte max sequence number, (d) a 4 byte size of the application checkpoint, (e) the application checkpoint, and (f) for each client, (i) a 4 byte sequence number, (ii) a 4 byte request id, (iii) a 4 byte response length k, and (iv) a reply message. The serialization is shown in Figure A.23. A.3.3 Execution Messages Fetch checkpoint. A hfetch-exec-cp, n, eiµ~ e,E message relies on the MAC au- thenticator byte layout. The paylod consists of a 4 byte sequene number shown in Figure A.24. Execution Checkpoint. A hexec-cp-state, n, S, eiµe,e′ message uses the basic MAC authentication byte layout. The payload consists of a 4 byte sequence number, a 4 byte checkpoint size, and the checkpoint from Figure A.23. The byte layout of 196 Base Sequence Number Current Sequence Number Max Sequence Number App Checkpoint Size App Checkpoint Sequence Number Request ID Response Size Response ... Sequence Number Request ID Response Size Response Figure A.23: Order node checkpoint byte specification. Sequence Number Figure A.24: Byte Specification of the payload of a hfetch-exec-cp, n, eiµ~ e,E message. 197 Sequence Number Checkpoint size Checkpoint Figure A.25: Byte Specification of the payload of a hexec-cp-state, n, S, eiµe,e′ message. State Token Size Token Figure A.26: Byte Specification of the payload of a hfetch-state, Tstate , eiµ~ e,E message. the payload is shown in Figure A.25. Fetch State. A hfetch-state, Tstate , eiµ~ e,E message uses the MAC authenticator byte layout. The payload consists of a 2 byte token size and a token as shown in Figure A.26. Send State. A hstate, Tstate , S, eiµe,e′ message relies on the simple MAC authen- tication byte layout. The payload consists of a 2 byte token size, the token, a 4 byte state size, and then the state as shown in Figure A.27. 198 State Token Size Token State Size State Figure A.27: Byte Specification of the payload of a hstate, Tstate , S, eiµe,e′ message. 199 Appendix B UpRight Library API This appendix describes the Java API between the UpRight library and replicated applications. Section B.1 describes the application client-library client API. Section B.2 describes the application server-library server API. B.1 Client API Figure B.1 depicts the four function calls provided by the UpRight library to application clients. The function calls provide synchronous (execute) and asynchronous (enqueue) calls instructing the library to execute general or read-only requests. Each call takes a byte array representation of the application request as a parameter. Figure B.2 depicts the functions that the application client should implement. All three functions are optional and are not required to support basic functionality of synchronous request execution. The function brokenConnection is used to signal the application when a network error occurs and is important for applications that explicitly rely on TCP connections to maintain sessions. The function returnReply is used in conjunction with asynchronous request execution and server initiated communication. The function canonicalEntry allows the application to select a canonical response (if it exists) from a quorum of responses that are semantically equivalent but based on different byte representations. 200 1 2 3 4 /∗∗ Returns the r e s u l t o f e x e c u t i n g o p e r a t i o n ∗∗/ p u b l i c byte [ ] execute ( byte [ ] o p e r a t i o n ) ; 6 7 8 9 /∗∗ Returns the r e s u l t o f e x e c u t i o n read only r e q u e s t o p e r a t i o n . ∗∗/ p u b l i c byte [ ] executeReadOnlyRequest ( byte [ ] o p e r a t i o n ) ; t h r o u g h t h e normal e x e c u t i o n path 11 12 13 14 /∗∗ Enqueues a r e a d o n l y r e q u e s t f o r a s y n c h r o n o u s ∗∗/ p u b l i c v o i d enqueueReadOnlyRequest ( b y t e [ ] op ) ; 16 17 18 19 /∗∗ Enque a r e g u l a r r e q u e s t f o r a s y n c h r o n o u s e x e c u t i o n ∗∗/ p u b l i c void enqueueRequest ( byte [ ] o p e r a t i o n ) ; execution Figure B.1: Interface exported by the UpRight library to the application client. 1 2 3 4 5 /∗∗ F u n c t i o n c a l l e d when t h e c o n n e c t i o n between t h e s e r v e r i s d e t e r m i n e d to be b r o k e n . ∗∗/ public 7 8 9 10 /∗∗ 12 13 14 15 16 17 /∗∗ c l i e n t and t h e void brokenConnection ( ) ; R e t u r n s a r e p l y r e c e i v e d from t h e s e r v e r ∗∗/ p u b l i c void returnReply ( byte [ ] r e p l y ) ; Considers the s e t of p o s s i b l e r e p l i e s options . Returns a canonical version of those r e p l i e s i f i t exists , returns null otherwise ∗∗/ p u b l i c byte [ ] canonicalEntry ( byte [ ] [ ] o p t i o n s ) ; Figure B.2: Interface implemented by the application client. 201 B.2 Server API Figure B.3 shows the six functions an UpRight application must impement. The first two functions, execute and executeReadOnly, are used to execute ordered batches or requests and read only requests respectively. The application calls result once per executed request. The application is expected to execute batches in the order they are received using the time and PRNG in the nondeterminism field. The final parameter of execute is a boolean takeCP. If takeCP is true, then the application is expected to take a checkpoint immediately after executing the current batch and before processing any requests contained in subsequent batches. The second two functions, loadCP and releaseCP, are required for managing application checkpoints. The function loadCP takes a byte array describing either the application checkpoint, or a description that the application can use to map to a specific checkpoint, and instructs the application to load the specified checkpoint. The function releaseCP is an optional call that should be used if the application provides a descriptor of its checkpoint rather than the checkpoint itself. releaseCP allows the application to manage checkpoints internally and potentially rely on incremental checkpoints for efficiency. The final two functions, fetchState and loadState are used in the case that the application reports tokens describing a checkpoint to the library. These functions are used to transfer small portions of an application checkpoint between replicas and allow for incremental state transfer when checkpoints are loaded. Figure B.4 shows the five functions exported by the UpRight library to the application server. The function result is called whenever the application finishes processing a request and inidcates the result of that computation, the clientId of the client that issued the request, the reqId associated with the request, and a boolen toCache indicating if the request is a regular response (true) and should be stored in the reply cache or not. The function readOnlyResult serves the same purpose, but is used only for responses to read only requests. The function returnCP is called by the application when it finishes generating a checkpoint following the execution of batch seqno. The checkpoint is described by the byte array AppCPToken; the format of AppCPToken is dictated by the application. The functions returnState and requestState facilitate the transfer of applica- 202 1 2 3 4 5 6 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Request Execution ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ E x e c u t e t h e commands i n b a t c h w i t h a s s o c i a t e d o r d e r s e q u e n c e number seqNo and u s i n g t i m e f o r any non−d e t e r m i n i s m 8 9 10 11 F o l l o w i n g e a c h command i n batch , shim . r e s u l t ( ) i s c a l l e d . ∗∗/ p u b l i c v o i d e x e c ( CommandBatch batch , l o n g seqNo , NonDeterminism time , b o o l e a n takeCP ) ; 13 14 /∗∗ 16 17 18 19 23 24 25 26 27 Execute o p e r a t i o n as a read only request . F o l l o w i n g e x e c u t i o n o f o p e r a t i o n , shim . r e a d O n l y R e s u l t ( ) i s called . ∗∗/ p u b l i c v o i d execReadOnly ( i n t c l i e n t I d , l o n g r e q I d , byte [ ] o p e r a t i o n ) ; /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ C h e c k p o i n t Management ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ Load t h e a p p l i c a t i o n 29 30 checkpoint Returns t r u e i f the checkpoint returns f a l s e otherwise i n d i c a t e d by cpToken is successfully loaded , 32 33 34 35 36 37 When loadCP r e t u r n s , i t i n d i c a t e s t h a t any r e q u e s t s e x e c u t e d a s p a r t o f a p r e c e d i n g c a l l to e x e c ( ) o r execReadOnly ( ) t h a t have not a l r e a d y g e n e r a t e d a r e s p o n s e w i l l not g e n e r a t e a f u t u r e response . ∗∗/ p u b l i c v o i d loadCP ( b y t e [ ] appCPToken , l o n g seqNo ) ; 39 40 41 42 /∗∗ 46 47 48 49 50 51 52 54 55 56 57 R e l e a s e t h e a p p l i c a t i o n c h e c k p o i n t d e s c r i b e d by appCPToken . ∗∗/ p u b l i c v o i d r e l e a s e C P ( b y t e [ ] appCPToken ) ; /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ State Transfer ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ F e t c h t h e s t a t e d e s c r i b e d by s t a t e T o k e n . ∗∗/ p u b l i c void f e t c h S t a t e ( byte [ ] stateToken ) ; /∗∗ Load t h e s t a t e s t a t e t h a t i s d e s c r i b e d by s t a t e T o k e n . ∗∗/ p u b l i c void l o a d S t a t e ( byte [ ] stateToken , byte [ ] s t a t e ) ; Figure B.3: Interface implemented by the application server and called by the UpRight library. The six functions can be considered as three pairs of common functionality: (a) request execution, (b) checkpoint management, and (c) state transfer. 203 1 2 3 4 5 6 7 8 9 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Request Execution ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ Upcall that d e l i v e r s the r e s u l t of executing c l i e n t I d ’ s r e q I d ˆ t h r e q u e s t a t seqNo p o s i t i o n i n t h e s e q u e n c e to t h e shim . ∗∗/ p u b l i c void r e s u l t ( byte [ ] r e s u l t , i n t c l i e n t I d , long reqId , l o n g seqNo , b o o l e a n toCache ) ; 11 12 13 14 15 /∗∗ 18 19 20 21 22 23 24 25 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ C h e c k p o i n t Management ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ U p c a l l d e l i v e r i n g t h e A p p l i c a t i o n c h e c k p o i n t t o k e n cpToken t a k e n a t b a t c h number seqNo to t h e shim ∗∗/ p u b l i c v o i d returnCP ( b y t e [ ] AppCPToken , l o n g seqNo ) ; 28 29 30 31 32 33 34 35 /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ State Transfer ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗ Upcall d e l i v e r i n g the a p p l i c a t i o n s t a t e s t a t e T o k e n to t h e shim ∗∗/ p u b l i c void r e t u r n S t a t e ( byte [ ] stateToken , 37 38 39 40 41 /∗∗ Upcall d e l i v e r i n g the read only request . ∗∗/ public result void readOnlyResult ( byte [ ] of executing result , int clientId ’ s clientId , r e q I d ˆ th long reqId ) ; c o r r e s p o n d i n g to byte [ ] state ) ; U p c a l l r e q u e s t a p p l i c a t i o n s t a t e d e s c r i b e d by s t a t e T o k e n from t h e shim ∗∗/ p u b l i c void r e q u e s t S t a t e ( byte [ ] stateToken ) ; Figure B.4: Interface exported by the UpRight library to the application server as call-backs. The functions can be considered in groups based on common functionality: (a) response processing, (b) checkpoint management, (c) state transfer, and (d) generic management. tion state between execution replicas. The application is expected to call returnState as part of processing a fetchState command. The application is expected to call requestState if it does not have a copy of the state required to successfully load a checkpoint. 204 Bibliography [1] M. Abd-El-Malek, G. Ganger, G. Goodson, M. Reiter, and J. Wylie. Faultscalable Byzantine fault-tolerant services. In Proc. 20th SOSP, Oct. 2005. [2] T. Abdollah. LAX outage is blamed on 1 computer. Los Angeles Times, Aug. 2007. [3] A. S. Aiyer, L. Alvisi, R. Bazzi, and A. Clement. Matrix signatures: From MACs to digital signatures in distributed systems. In Proc. DISC 2008), pages 16–31, Oct. 2008. [4] A. S. Aiyer, L. Alvisi, A. Clement, M. Dahlin, J.-P. Martin, and C. Porth. BAR fault tolerance for cooperative services. In Proc. 20th SOSP, Oct. 2005. [5] L. Alvisi and K. Marzullo. Message Logging: Pessimistic, Optimistic, Causal, and Optimal. IEEE Transactions on Software Engineering, 24(2):149–159, February 1998. [6] Amazon elastic compute cloud. http://aws.amazon.com/ec2/, Mar. 2009. [7] Y. Amir, B. Coan, J. Kirsch, and J. Lane. Byzantine replication under attack. In International Conference on Dependable Systems and Networks, June 2008. [8] W. Bartlett and L. Spainhower. Commercial fault tolerance: A tale of two systems. IEEE Transactions on Dependable and Secure Computing, 1(1):87– 96, 2004. [9] BFT project homepage. http://www.pmg.csail.mit.edu/bft/#sw. [10] R. H. Black. Using proven aircraft avionics principles to support a responsive space infrastructure. In 4th Responsive Space Conference, 2006. 205 [11] G. Bracha and S. Toueg. Asynchronous consensus and broadcast protocols. J. ACM, 32(4):824–840, 1985. [12] M. Burrows. The chubby lock service for loosely-coupled distributed systems. In Proc. 7th OSDI, 2006. [13] C. Cachin, K. Kursawe, and V. Shoup. Random oracles in constantipole: practical asynchronous byzantine agreement using cryptography (extended abstract). In PODC ’00: Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing, pages 123–132, New York, NY, USA, 2000. ACM. [14] M. Calore. Ma.gnolia suffers major data loss, site taken offline. Wired, Jan. 2009. [15] R. Canetti and T. Rabin. Optimal Asynchronous Byzantine Agreement. Technical Report 92-15, TR 92-15, Dept. of Computer Science, Hebrew University, 1992. [16] M. Castro. Practical Byzantine Fault Tolerance. PhD thesis, Jan. 2001. [17] M. Castro and B. Liskov. Practical Byzantine fault tolerance. In Proc. 3rd OSDI, pages 173–186, Feb. 1999. [18] M. Castro and B. Liskov. Practical Byzantine fault tolerance and proactive recovery. ACM Trans. Comput. Syst., 2002. [19] T. Chandra, R. Griesmer, and J. Redstone. Paxos made live – an engineering perspective. In Proc. 26th PODC, June 2007. [20] P. Chen, E. Lee, G. Gibson, R. Katz, and D. Patterson. RAID: High- performance, reliable secondary storage. ACM Computing Surveys, 26:145– 185, 1994. [21] At LAX, computer glitch delays 20,000 passengers. http://travel.latimes.com/articles/la-trw-lax12aug12. [22] G. Chockler, D. Malkhi, and M. Reiter. Backoff protocols for distributed mutual exclusion and ordering. In ICDCS-21, pages 11–20, 2001. 206 [23] A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, and T. Riche. Upright cluster services. In SOSP ’09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 277–290, New York, NY, USA, 2009. ACM. [24] A. Clement, M. Marchetti, E. Wong, L. Alvisi, and M. Dahlin. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proc. 6th NSDI, Apr. 2009. [25] T. Condie, V. Kacholia, S. Sankararaman, J. M. Hellerstein, and P. Maniatis. Induced churn as shelter from routing-table poisoning. In NDSS, 2006. [26] J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira. HQ replication: A hybrid quorum protocol for Byzantine fault tolerance. In Proc. 7th OSDI, Nov. 2006. [27] F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: from simple message diffusion to byzantine agreement. Inf. Comput., 118(1):158– 179, 1995. [28] C. Delporte-Gallet, H. Fauconnier, F. C. Freiling, L. D. Penso, and A. Tielmann. From crash-stop to permanent omission: Automatic transformation and weakest failure detectors. In A. Pelc, editor, DISC, volume 4731 of Lecture Notes in Computer Science, pages 165–178. Springer, 2007. [29] D. Dolev and H. R. Strong. Authenticated algorithms for Byzantine agreement. Siam Journal Computing, 12(4):656–666, Nov. 1983. [30] A. Doudou, B. Garbinato, R. Guerraoui, and A. Schiper. Muteness Failure Detectors, Specification and Implementation. Technical report, 1999. [31] K. Driscoll, B. Hall, M. Paulitsch, P. Zumstag, and H. Sivencrona. The real Byzantine generals. In Digital Avionics Ssytems Conference, 2004. [32] P. Dutta, R. Guerraoui, and M. Vukolić. Best-case complexity of asynchronous Byzantine consensus. Technical Report EPFL/IC/200499, EPFL, Feb. 2005. [33] C. Dwork, N. Lynch, and L. Stockmeyer. Consensus in the presence of partial synchrony. J. ACM, 35(2):288–323, 1988. 207 [34] E. N. Elnozahy, L. Alvisi, Y. Wang, and D. B. Johnson. A survey of rollbackrecovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375–408, 2002. [35] M. Fischer, N. Lynch, and M. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382, 1985. [36] The FlexiProvider Group. the FlexiProvider Project. http://www. flexiprovider.de. [37] R. Friedman and R. V. Renesse. Packing messages as a tool for boosting the performance of total ordering protocls. In HPDC, 1997. [38] J. Garay and Y. Moses. Fully Polynomial Byzantine Agreement for n>3t Processors in t + 1 Rounds. SIAM J. of Computing, 27(1):247–290, 1998. [39] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. 19th SOSP, pages 29–43. ACM Press, 2003. [40] T. J. Giuli, P. Maniatis, M. Baker, D. S. H. Rosenthal, and M. Roussopoulos. Attrition defenses for a peer-to-peer digital preservation system. In USENIX, 2005. [41] J. Gray. A census of Tandem system availability between 1985 and 1990. IEEE Trans. on Reliability, 39(4), Oct. 1990. [42] R. Guerraoui. The next 700 bft protocols. In T. P. Baker, A. Bui, and S. Tixeuil, editors, OPODIS, volume 5401 of Lecture Notes in Computer Science, page 1. Springer, 2008. [43] Hadoop. http://hadoop.apache.org/core/. [44] Hdfs. http://hadoop.apache.org/hdfs. [45] J. Hendricks, G. R. Ganger, and M. K. Reiter. Low-overhead Byzantine faulttolerant storage. In SOSP, 2007. [46] M. Iacoponi. System architecture for byzantine resilient computation in launch vehicle applications. In Digital Avionics Systems Conference, 1990. 208 [47] K. P. Kihlstrom, L. E. Moser, and P. M. Melliar-Smith. The securering group communication system. ACM Trans. Inf. Syst. Secur., 4(4):371–406, 2001. [48] C. Killian, J. Anderson, R. Jhala, and A. Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In NSDI, 2007. [49] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong. Zyzzyva: Speculative Byzantine fault tolerance. In Proc. 20th SOSP, Oct. 2007. [50] R. Kotla and M. Dahlin. High throughput Byzantine fault tolerance. In Conference on Dependable Systems and Networks, DSN’04, June 2004. [51] M. Lagos and M. Stannard. Power restored in San Francisco. San Francisco Chronicle, July 2007. [52] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 1978. [53] L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169, 1998. [54] L. Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column), 32(4):51–58, Dec. 2001. [55] L. Lamport. Lower bounds for asynchronous consensus. In Proceedings of the International Workshop on Future Directions in Distributed Computing, pages 22–23, June 2003. [56] L. Lamport. Lower bounds for asynchronous consensus. Technical Report MSR-TR-2004-72, Microsoft Research, July 2004. [57] L. Lamport. Fast Paxos. Technical Report MSR-TR-2005-112, Microsoft Research, July 2005. [58] L. Lamport. Generalized consensus and paxos. Technical Report MSR-TR2005-33, Mar. 2005. [59] L. Lamport and M. Fischer. Byzantine generals and transaction commit protocols. Technical Report 62, SRI International, 1982. 209 [60] L. Lamport and M. Masa. Cheap paxos. In Proc. DSN-2004, pages 307–314, June 2004. [61] L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem. ACM Trans. Program. Lang. Syst., 1982. [62] B. W. Lampson. Hints for computer system design. SIGOPS Oper. Syst. Rev., 17, 1983. [63] A. Mahimkar, J. Dange, V. Shmatikov, H. Vin, and Y. Zhang. dFence: Transparent network-based denial of service mitigation. In NSDI, 2007. [64] D. Malkhi and M. Reiter. A high-throughput secure reliable multicast protocol. In CSFW ’96: Proceedings of the 9th IEEE workshop on Computer Security Foundations, page 9, Washington, DC, USA, 1996. IEEE Computer Society. [65] D. Malkhi and M. Reiter. Unreliable intrusion detection in distributed computations. In CSFW ’97: Proceedings of the 10th IEEE workshop on Computer Security Foundations, page 116, Washington, DC, USA, 1997. IEEE Computer Society. [66] P. Maniatis, M. Roussopoulos, T. J. Giuli, D. S. H. Rosenthal, and M. Baker. The LOCKSS peer-to-peer digital preservation system. ACM Trans. Comput. Syst., 2005. [67] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius: building efficient replicated state machines for wans. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI’08, pages 369–384, Berkeley, CA, USA, 2008. USENIX Association. [68] J.-P. Martin and L. Alvisi. Fast Byzantine consensus. IEEE Transactions on Dependable and Secure Computing, 3(3):202–215, July 2006. [69] J.-P. Martin, L. Alvisi, and M. Dahlin. Minimal Byzantine storage. In 16th International Conference on Distributed Computing, DISC 2002, pages 311– 325, Oct. 2002. 210 [70] J.-P. Martin, L. Alvisi, and M. Dahlin. Small Byzantine quorum systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN 02), DCC Symposium, pages 374–383, June 2002. [71] S. Misel. Wow, AS7007! NANOG mail archives http://www.merit.edu/mail.archives/nanog/1997-04/msg00340.html. [72] C. Mohan, R. Strong, and S. Finkelstein. Method for distributed transaction commit and recovery using byzantine agreement within clusters of processors. In PODC ’83: Proceedings of the second annual ACM symposium on Principles of distributed computing, pages 89–103, New York, NY, USA, 1983. ACM. [73] J. Napper. Robust Multithreaded Applications. PhD thesis, The University of Texas at Austin, 2008. [74] Netty project. http://www.jboss.org/netty.html. [75] E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and J. Flinn. Rethink the sync. In Proc. 7th OSDI, Nov. 2006. [76] NS-2. http://www.isi.edu/nsnam/ns/. [77] B. Oki and B. Liskov. Viewstamped replication: A general primary copy method to support highly-available distributed systems. In Proc. 7th PODC, 1988. [78] D. Oppenheimer, A. Ganapathi, and D. Patterson. Why do internet services fail, and what can be done about it, 2003. [79] M. Pease, R. Shostak, and L. Lamport. Reaching agreement in the presence of faults. Journal of the ACM, 27(2):228-234, Apr. 1980. [80] K. J. Perry and S. Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Trans. Softw. Eng., 12(3):477–482, 1986. [81] E. Pinheiro, W. Weber, and L. Barroso. Failure trends in a large disk drive population. In Proc. USENIX FAST, 2007. 211 [82] V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Iron file systems. In Proc. 20th SOSP, 2005. [83] Query/Update protocol. http://www.pdl.cmu.edu/QU/index.html. [84] M. K. Reiter. A secure group membership protocol. IEEE Trans. Softw. Eng., 22(1):31–42, 1996. [85] A. Rich. ZFS, sun’s cutting-edge file system. Technical report, Sun Microsystems, 2006. [86] R. Rodrigues, M. Castro, and B. Liskov. BASE: using abstraction to improve fault tolerance. In Proc. 18th SOSP, Oct. 2001. [87] F. B. Schneider. Byzantine generals in action: implementing fail-stop processors. ACM Trans. Comput. Syst., 2(2):145–154, 1984. [88] F. B. Schneider. Implementing fault–tolerant services using the state machine approach: A tutorial. Computing Surveys, 22(3):299–319, September 1990. [89] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 22(4):299–319, Sept. 1990. [90] B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you. In Proc. USENIX FAST, 2007. [91] B. Schroeder, E. Pinheiro, and W.-D. Weber. Dram errors in the wild: a largescale field study. In SIGMETRICS ’09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, pages 193–204, New York, NY, USA, 2009. ACM. [92] M. Serafini, P. Bokor, D. Dobre, M. Majuntke, and N. Suri. Scrooge: Reducing the costs of fast byzantine replication in presence of unresponsive replicas. In Proc. of the IEEE Int’l Conf. on Dependable Systems and Networks (DSN), 2010. [93] M. Serafini, P. Bokor, and N. Suri. Scrooge: Stable speculative byzantine fault tolerance using testifiers. Technical report, Darmstadt University of Technology, Department of Computer Science, September 2008. 212 [94] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proc. DSN-2002, pages 389–398, 2002. [95] A. Singh, T. Das, P. Maniatis, P. Druschel, and T. Roscoe. Bft protocols under fire. In NSDI’08: Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, pages 189–204, Berkeley, CA, USA, 2008. USENIX Association. [96] Single network card downed lax computers. http://www.tgdaily.com/content/view/33398/113/. [97] A. S. Team. Amazon S3 availability event: July 20, 2008. http://status.aws.amazon.com/s3-20080720.html, 2008. [98] P. Thambidurai and Y.-K. Park. Interactive consistency with multiple failure modes. In Proc. 7th SRDS, 1988. [99] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: A scalable distributed file system. In Proc. 16th SOSP, pages 224–237, 1997. [100] B. Vandiver, H. Balakrishnan, B. Liskov, and S. Madden. Tolerating Byzantine faults in transaction processing systems using commit barrier scheduling. In Proc. 20th SOSP, 2007. [101] M. Walfish, M. Vutukuru, H. Balakrishnan, D. Karger, and S. Shenker. DDoS defense by offense. In SIGCOMM, 2006. [102] H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic misconfiguration troubleshooting with peerpressure. In OSDI, Dec. 2004. [103] B. White, J. Lepreau, L. Stoller, R. Ricci, S. Guruprasad, M. Newbold, M. Hibler, C. Barb, and A. Joglekar. An integrated experimental environment for distributed systems and networks. In Proc. 5th OSDI, pages 255–270, Boston, MA, Dec. 2002. USENIX Association. [104] T. Wood, R. Singh, A. Venkataramani, and P. Shenoy. ZZ: Cheap practical BFT using virtualization. Technical Report TR14-08, University of Massachusetts, 2008. 213 [105] J. Yang, C. Sar, and D. Engler. EXPLODE: A lightweight, general system for finding serious storage system errors. In Proc. 7th OSDI, 2006. [106] J. Yang, P. Twohey, D. Engler, and M. Musuvathi. Using model checking to find serious file system errors. In OSDI’04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pages 19–19, Berkeley, CA, USA, 2004. USENIX Association. [107] J. Yin, J.-P. Martin, A. Venkataramani, L. Alvisi, and M. Dahlin. Separating agreement from execution for Byzantine fault tolerant services. In Proc. 19th SOSP, Oct. 2003. [108] Zookeeper. http://hadoop.apache.org/zookeeper. 214 Vita Allen Clement was born in Alexandria, Virginia, as the son of two lawyers. His family moved to Houston, Texas, when he was seven and he lived there until he graduated from Strake Jesuit College Preparatory in 1996. He attended Princeton University in Princeton, New Jersy, where he graduated with an A.B. in Computer Science in 2000. He taught introductory Java programming at Ngee Ann Polytechnic in Singapore from July 2000 through June 2001. He spent fall 2001 through May 2002 studying computational geometry and hypercube embedding at the University of British Columbia in Vancouver, British Columbia. In fall 2002 he enrolled in the PhD program in the Department of Computer Sciences at the University of Texas at Austin, where he was a teaching assistant and graduate research assistant. Permanent Address: 1902 Coulcrest Houston, Texas, 77055 This dissertation was typeset with LATEX 2ε 1 by the author. 1 A LT EX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademark of the American Mathematical Society. The macros used in formatting this dissertation were written by Dinesh Das, Department of Computer Sciences, The University of Texas at Austin, and extended by Bert Kay, James A. Bednar, and Ayman El-Khashab. 215