SystemDesign AWS Excerpt

System Design
on AWS
Building and Scaling Enterprise Solutions
Early
Release
Raw & Unedited
Compliments of
Jayanth Kumar &

Mandeep Singh
Dragonfly Cloud:
A modern Redis
replacement
Dragonfly Cloud is a fully managed, simple and cost-efficient
replacement for ElasticCache and other Redis based
memory stores.
22k stars
Reduce Infrastructure Costs
Simplify Your Ops
Built-in Reliability
Boost Performance
START FREE: dragonflydb.io/cloud

System Design on AWS
Building and Scaling Enterprise Solutions
With Early Release ebooks, you get books in their earliest

form—the authors’ raw and unedited content as they write—
so you can take advantage of these technologies long before
the official release of these titles.
Jayanth Kumar and Mandeep Singh
Beijing Boston Farnham Sebastopol Tokyo

System Design on AWS
by Jayanth Kumar and Mandeep Singh
Copyright © 2024 Jayanth Kumar and Mandeep Singh. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are
also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional
sales department: 800-998-9938 or [email protected].
Acquisitions Editor: Megan Laddusaw Interior Designer: David Futato

Development Editor: Melissa Potter Cover Designer: Karen Montgomery
Production Editor: Katherine Tozer Illustrator: Kate Dullea
October 2024: First Edition
Revision History for the Early Release

2023-06-27: First Release
2023-08-31: Second Release
2023-10-09: Third Release
2023-12-07: Fourth Release
2024-02-02: Fifth Release
See http://oreilly.com/catalog/errata.csp?isbn=9781098146894 for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. System Design on AWS, the cover
image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use
of or reliance on this work. Use of the information and instructions contained in this work is at your
own risk. If any code samples or other technology this work contains or describes is subject to open
source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights. This work is part of a collaboration between O'Reilly
and Dragonfly. See our statement of editorial independence.
978-1-098-14683-2
[FILL IN]
Table of Contents
1. System Design Trade-offs and Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

System Design Concepts 6
Communication 6
Consistency 8
Availability 13
Reliability 19
Scalability 20
Maintainability 22
Fault Tolerance 23
Fallacies of Distributed Computing 24
System Design Trade-offs 26
Time vs Space 27
Latency vs Throughput 27
Performance vs Scalability 28
Consistency vs Availability 29
System Design Guidelines 32
Guideline of Isolation: Build It Modularly 32
Guideline of Simplicity: Keep it Simple, Silly 32
Guideline of Performance: Metrics Don’t Lie 33
Guideline of Tradeoffs: There Is No Such Thing As A Free Lunch 34
Guideline of Use Cases: It Always Depends 34
Conclusion 35
4. Caching Policies and Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Caching Benefits 38
Cache Eviction Policies 39
Belady’s Algorithm 39
Allowlist Policy 41
iii
Cache Invalidation 41
Caching Strategies 43
Caching Deployment 45
Caching Mechanisms 46
Content Delivery Networks 48
Open Source Caching Solutions 51
Memcached 51
Redis 52
Conclusion 57
iv | Table of Contents
CHAPTER 1
System Design Trade-offs and Guidelines
A Note for Early Release Readers

With Early Release ebooks, you get books in their earliest form—the authors’ raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 1st chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at [email protected].
Today’s modern technological revolution is happening because of large scale software

systems. Big enterprise companies like Google, Amazon, Oracle, and SAP have all
built large scale software systems to run their (and their customer’s) businesses.
Building and operating such large scale software systems requires first principles
thinking to design and develop the technical architecture before actually putting
the system into code. This is because we don’t want to be in a state where these
systems will not work/scale after writing 10k lines of code. If the design is right in
the first place, the rest of the implementation journey becomes smooth. This requires
looking at the business requirements, understanding the needs and objectives of the
customer, evaluating different trade-offs, thinking about error handling and edge
cases, contemplating futuristic changes and robustness while worrying about basic
details like algorithms and data structures. Enterprises can avoid the mistake of
wasted software development effort by carefully thinking about systems and investing
time in understanding bottlenecks, system requirements, users being targeted, user
access patterns and many such decisions, which in short, is System Design.
5
This chapter covers the basics of system design, with the goal of helping you to
understand the concepts around system design itself, the system trade-offs that nat‐
urally arise in such large-scale software systems, the fallacies to avoid in building
such large scale systems and the guidelines—those wisdoms which were learnt after
building such large scale software systems over the years. This is simply meant to
introduce you to the basics—we’ll dig into details in later chapters, but we want you
to have a good foundation to start with. Let’s begin by digging into the basic system
design concepts.
System Design Concepts

To understand the building blocks of system design, we should understand the
fundamental concepts around systems. We can leverage abstraction here, the concept
in computer science of obfuscating the inner details to create a model of these system
design concepts, which can help us to understand the bigger picture. The concepts
in system design, be it any software system, revolve around communication, consis‐
tency, availability, reliability, scalability, fault tolerance and system maintainability.
We will go over each of these concepts in detail, creating a mental model while also
understanding how their nuances are applied in large scale system design.
Communication
A large scale software system is composed of small sub-systems, known as servers,
which communicate with each other, i.e. exchange information or data over the net‐
work to solve a business problem, provide business logic, and compose functionality.
Communication can take place in either a synchronous or asynchronous fashion,
depending on the needs and requirements of the system.
Figure 1-1 shows the difference in the action sequence of both synchronous and
asynchronous communication.
6 | Chapter 1: System Design Trade-offs and Guidelines

Figure 1-1. Sequence Diagram for Synchronous vs Asynchronous Communication
Let’s go over the details of both communication mechanisms in the following sec‐
tions.
We will cover the communication protocols as well as mechanisms

for asynchronous communication in detail in Chapter 6: Commu‐
nication Networks and Protocols.
Synchronous Communication
Consider a phone call conversation with your friend, you hear and speak with them
at the same time and also use pauses in between to allow for conversation to com‐
plete. This is an example of synchronous communication, a type of communication
in which two or more parties communicate with each other in real-time, with low
latency. This type of communication is more immediate and allows for quicker
resolution of issues or questions.
System Design Concepts | 7

In system design, a communication mechanism is synchronous when the receiver
will block (or wait) for the call or execution to return before continuing. This means
that until a response is returned by the sender, the application will not execute any
further, which could be perceived by the user as latency or performance lag in the
application.
Asynchronous Communication
To give you an example of asynchronous communication, consider that instead of a
phone call conversation, you switch to email. As you communicate over email with
your friend, you send the message and wait for the reply at a later time (but within
an acceptable time limit). You also follow a practice to follow up again if there is
no response after this time limit has passed. This is an example of asynchronous
communication, a type of communication in which two or more parties do not
communicate with each other in real-time. Asynchronous communication can also
take place through messaging platforms, forums, and social media, where users can
post messages and responses may not be immediate.
In system design, a communication mechanism is asynchronous when the sender
does not block (or wait) for the call or execution to return from the receiver. Execu‐
tion continues on in your program or system, and when the call returns from the
receiving server, a “callback” function is executed. In system design, asynchronous
communication is often used when immediate response is not required, or when the
system needs to be more flexible and tolerant of delays or failures.
In general, the choice of synchronous or asynchronous communication in system
design depends on the specific requirements and constraints of the system. Synchro‐
nous communication is often preferred when real-time response is needed (such as
the communication between the frontend UI and the backend), while asynchronous
communication is often preferred when flexibility and robustness are more important
(such as the communication to check the status of a long running job).
Consistency
Consistency, i.e. the requirement of being consistent, or in accordance with a set of
rules or standards, is an important issue when it comes to communication between
servers in a software system. Consistency can refer to a variety of concepts and
contexts in system design.
In the context of distributed systems, consistency can be the property of all replica
nodes (more on this in a moment) or servers having the same view of data at a given
point in time. This means that all replica nodes have the same data, and any updates
to the data are immediately reflected on all replica nodes.

In the context of data storage and retrieval, consistency refers to the property of each
read request returning the value of the most recent write. This means that if a write
operation updates the value of a piece of data, any subsequent read requests for that
data should return the updated value. Let’s discuss each of these in more detail.
Consistency in Distributed Systems

Distributed systems are software systems, which are separated physically but connec‐
ted over the network to achieve common goals using shared computing resources
over the network.
Ensuring consistency, i.e. providing the same view of the data to each server in
distributed systems can be challenging, as multiple replica servers may be located in
different physical locations and may be subject to different failures or delays.
To address these challenges, we can use various techniques in distributed systems to
ensure data consistency, such as:
Data replication
In this approach, multiple copies of the data are maintained on different replica
nodes, and updates to the data are made on all replica nodes simultaneously
through blocking synchronous communication. This ensures that all replica
nodes have the same view of the data at any given time.
Consensus protocols
Consensus protocols are used to ensure that all replica nodes agree on the
updates to be made to the data. They can use a variety of mechanisms, such as
voting or leader election, to ensure that all replica nodes are in agreement before
updating the data.
Conflict resolution
In the event that two or more replica nodes try to update the same data simul‐
taneously, conflict resolution algorithms are used to determine which update
should be applied. These algorithms can use various strategies, such as last writer
wins or merge algorithms, to resolve conflicts.
Overall, ensuring consistency in a distributed system is essential for maintaining the
accuracy and integrity of the data, and various techniques are used to achieve this
goal.
Consistency in Data Storage and Retrieval

Large scale software systems produce and consume a large amount of data and thus,
ensuring consistency in such data storage and retrieval is important for maintaining
the accuracy and integrity of the data in these systems. For example, consider a
database that stores the balance of a bank account. If we withdraw money from the

account, the database should reflect the updated balance immediately. If the database
does not ensure consistency, it is possible for a read request to return an old balance,
which could lead to incorrect financial decisions or even, financial loss for us or our
banks.
To address these challenges, we can use various techniques in data storage systems to
ensure read consistency, such as:
Write-ahead logging
In this technique, writes to the data are first recorded in a log before they are
applied to the actual data. This ensures that if the system crashes or fails, the data
can be restored to a consistent state by replaying the log.
Locking
Locking mechanisms are used to ensure that only one write operation can be
performed at a time. This ensures that multiple writes do not interfere with each
other and that reads always return the value of the most recent write.
Data versioning
In this technique, each write operation is assigned a version number, and reads
always return the value of the most recent version. This allows for multiple writes
to be performed concurrently, while still ensuring that reads return the value of
the most recent write.
Overall, ensuring consistency in data storage and retrieval is essential for maintaining
the accuracy and integrity of the data, and various techniques are used to achieve this
goal.
We will discuss some of the above techniques for ensuring Consis‐

tency in detail in Chapter 2: Storage Types and Relational Stores
and Chapter 3: Non-Relational Stores.
Consistency Spectrum Model

Since consistency can mean different things, the consistency spectrum model helps
us reason about whether a distributed system is working correctly, when it’s doing
multiple concurrent things at the same time like reading, writing, and updating data.
The consistency spectrum model represents the various consistency guarantees that
a distributed system can offer, ranging from Eventual Consistency to Strong Con‐
sistency. The specific consistency guarantee chosen depends on the specific require‐
ments and constraints of the system. Let’s walk through the consistency levels in the
consistency spectrum model.

Strong Consistency
At one end of the spectrum, strong consistency guarantees that all replica nodes
have the same view of the data at all times, and that any updates to the data are
immediately reflected on all replica nodes. This ensures that the data is always
accurate and up-to-date, but can be difficult to achieve in practice, as it requires
all replica nodes to be in constant communication with each other.
We will cover the strong consistency requirements of relational

databases as part of ACID property in Chapter 2: Storage Types
and Relational Stores.
Monotonic read consistency

Monotonic read consistency guarantees that once a client has read a value from a
replica node, all subsequent reads from that client will return the same value or a
more recent value. This means that a client will not see “stale” data that has been
updated by another client. This provides a stronger consistency guarantee than
eventual consistency, as it ensures that a client will not see outdated data.
Monotonic write consistency
Monotonic write consistency guarantees that once a write operation has been
acknowledged by a replica node, all subsequent reads from that replica node will
return the updated value. This means that a replica node will not return outdated
data to clients after a write operation has been acknowledged. This provides a
stronger consistency guarantee than eventual consistency, as it ensures that a
replica node will not return outdated data to clients.
Causal consistency
Causal consistency works by categorizing operations into dependent and inde‐
pendent operations. Dependent operations are also called causally-related opera‐
tions. Causal consistency preserves the order of the causally-related operations.
It guarantees that if two operations are causally related into dependent and
independant operations, then they will be seen in the same order by all processes
in the system. This means that if operation A must happen before operation B,
then all processes in the system will see A before they see B. This provides a
stronger consistency guarantee than eventual consistency, as it ensures that the
order of related operations is preserved.
Eventual Consistency
At the other end of the spectrum, eventual consistency guarantees that, given
enough time, all replica nodes will eventually have the same view of the data. This
allows for more flexibility and tolerance of delays or failures, but can result in
temporary inconsistencies in the data.

We will cover the tunable consistency feature of some of the
non-relational columnar databases like Cassandra in Chapter 3:
Non-relational Stores.
Figure 1-2 shows the difference in the result of performing action sequence under
strong consistency and eventual consistency. As you can see in the figure on the left
for strong consistency, when x is read from a replica node after updating it from 0
to 2, it will block the request until replication happens and then, return 2 as result.
On the other side in the figure on the right for eventual consistency, on querying the
replica node, it will give stale result of x as 0 before replication completes.
Figure 1-2. Sequence Diagram for Strong Consistency and Eventual Consistency
In general, the consistency spectrum model provides a framework for understanding

the trade-offs between consistency and availability in distributed systems, and helps
system designers choose the appropriate consistency guarantee for their specific
needs.

Availability
In a large-scale software system, subsystems or servers can go down and may not
be fully available to respond to the client’s requests — this is referred to as system’s
availability. A system that is highly available is able to process requests and return
responses in a timely manner, even under heavy load or in the face of failures or
errors. Let’s try to quantify the measurement of availability of the system.
Measuring Availability
Availability can be measured mathematically as the percentage of the time the system
was up (total time - time system was down) over the total time the system should
have been running.
Total Time − Sum total o f time system was down

Availability % = Total Time
× 100
Availability percentages are represented in 9s, based on the above formula over a
period of time. You can see the breakdown of what these numbers really work out to
in Table 1-1.
Table 1-1. Availability Percentages Represented in 9s

Availability % Downtime per Year Downtime per Month Downtime per Week
90% (1 nine) 36.5 days 72 hours 16.8 hours
99% (2 nines) 3.65 days 7.2 hours 1.68 hours
99.5% (2 nines) 1.83 days 3.60 hours 50.4 minutes
99.9% (3 nines) 8.76 hours 43.8 minutes 10.1 minutes
99.99% (4 nines) 52.56 minutes 4.32 minutes 1.01 minutes
99.999% (5 nines) 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% (6 nines) 31.5 seconds 2.59 seconds 0.605 seconds
99.99999% (7 nines) 3.15 seconds 0.259 seconds 0.0605 seconds
The goal for availability is usually to achieve the highest level possible, such as “five
nines” (99.999%) or even “six nines” (99.9999%). However, the level of availability
that is considered realistic or achievable depends on several factors, including the
complexity of the system, the resources available for maintenance and redundancy,
and the specific requirements of the application or service.
Achieving higher levels of availability becomes progressively more challenging
and resource-intensive. Each additional nine requires an exponential increase in
redundancy, fault-tolerant architecture, and rigorous maintenance practices. It often
involves implementing redundant components, backup systems, load balancing, fail‐
over mechanisms, and continuous monitoring to minimize downtime and ensure
rapid recovery in case of failures.

While some critical systems, such as financial trading platforms or emergency serv‐
ices, may strive for the highest levels of availability, achieving and maintaining them
can be extremely difficult and costly. In contrast, for less critical applications or
services, a lower level of availability, such as 99% or 99.9%, may be more realistic and
achievable within reasonable resource constraints.
Ultimately, the determination of what level of availability is realistic and achievable
depends on a careful evaluation of the specific requirements, resources, costs, and
trade-offs involved in each particular case.
Availability in parallel vs in sequence

The availability of a system that consists of multiple sub-systems depends on whether
the components are arranged in sequence or in parallel with respect to serving the
request.
Figure 1-3 shows the arrangement of components in sequential system on the left,
where the request needs to be served from each component in sequence vs the
parallel system on the right, where the request can be served from either component
in parallel.
Figure 1-3. Sequential system vs Parallel system
If the components are in sequence, the overall availability of the service will be the
product of the availability of each component. For example, if two components with
99.9% availability are in sequence, their total availability will be 99.8%.
On the other hand, if the components are in parallel, the overall availability of the
service will be the sum of the availability of each component minus the product of

their unavailability. For example, if two components with 99.9% availability are in
parallel, their total availability will be 99.9999%. This can lead to significantly higher
availability compared to the same components arranged in sequence (6 9s from 3 9s
in the above example).
Overall, the arrangement of components in a service can have a significant impact on

its overall availability, and it is important to consider this when designing a system for
high availability.
Ensuring Availability
Ensuring availability in a system is important for maintaining the performance and
reliability of the system. There are several ways to increase the availability of a system,
including:
Redundancy
By having multiple copies of critical components or subsystems, a system can
continue to function even if one component fails. This can be achieved through
the use of redundant load balancers, failover systems, or replicated data stores.
Fault tolerance
By designing systems to be resistant to failures or errors, the system can continue
to function even in the face of unexpected events. This can be achieved through
the use of error-handling mechanisms, redundant hardware, or self-healing sys‐
tems.
Load balancing
By distributing incoming requests among multiple servers or components, a
system can more effectively handle heavy load and maintain high availability.
This can be achieved through the use of multiple load balancers or distributed
systems.
We will cover load balancing in detail in Chapter 5: Scaling

Approaches and Mechanisms and the different types of AWS load
balancers in Chapter 9: AWS Network Services.

Overall, ensuring availability in a system is important for maintaining the perfor‐
mance and reliability of the system, and various techniques can be used to increase
availability.
Availability Patterns
To ensure availability, there are two major complementary patterns to support high
availability: fail-over and replication pattern.
Failover Patterns. Failover refers to the process of switching to a redundant or backup

system in the event of a failure or error in the primary system. The failover pattern
chosen depends on the specific requirements and constraints of the system, including
the desired level of availability and the cost of implementing the failover solution.
There are two main types of failover patterns: active-active and active-passive.
Active-active failover
In an active-active failover pattern as shown on the left in Figure 1-4, multiple
systems are used in parallel, and all systems are actively processing requests. If
one system fails, the remaining systems can continue to process requests and
maintain high availability. This approach allows for more flexibility and better
utilization of resources, but can be more complex to implement and maintain.
Active-passive failover
In an active-passive failover pattern as shown on the right in Figure 1-4, one
system is designated as the primary system and actively processes requests, while
one or more backup systems are maintained in a passive state. If the primary
system fails, the backup system is activated to take over processing of requests.
This approach is simpler to implement and maintain, but can result in reduced
availability if the primary system fails, as there is a delay in switching to the
backup system.

Figure 1-4. Active Active Failover system setup vs Active Passive Failover system setup
The failover pattern can involve the use of additional hardware and can add com‐
plexity to the system. There is also the potential for data loss if the active system
fails before newly written data can be replicated to the passive system. Overall, the
choice of failover pattern depends on the specific requirements and constraints of the
system, including the desired level of availability and the cost of implementing the
failover solution.
These failover patterns are employed in relational datastores, non-

relational data stores and caches, and load balancers, which we will
cover in detail in Chapter 2: Storage Types and Relational Stores,
Chapter 3: Non-relational Stores, Chapter 4: Caching Policies and
Strategies and Chapter 5: Scaling Approaches and Mechanisms
respectively.
Replication Patterns. Replication is the process of maintaining multiple copies of data

or other resources in order to improve availability and fault tolerance. The replication
pattern chosen depends on the specific requirements and constraints of the system,
including the desired level of availability and the cost of implementing the replication
solution.
There are two main types of replication patterns: Multi leader and Single leader.
Multi leader replication
In a multi leader replication pattern as shown on the left in Figure 1-5, multiple
systems are used in parallel and all systems are able to both read and write

data. This allows for more flexibility and better utilization of resources, as all
systems can process requests and updates to the data simultaneously. A load
balancer is required or application logic changes need to be made to support
multiple leaders and identify on which leader node to write. Most multi leader
systems are either loosely consistent or have increased write latency due to
synchronization to remain consistent. Conflict resolution comes more into play
as more write nodes are added, leading to increase in latency. However, this
approach can become more complex to implement and maintain, as it requires
careful management of conflicts and errors.
Single leader replication
In a single leader replication pattern as shown on the right in Figure 1-5, one
system is designated as the leader system and is responsible for both reading
and writing data, while one or more follower systems are used to replicate the
data. The follower systems can only be used for reading data, and updates to
the data must be made on the leader system. Additional logic is required to
be implemented to promote a follower to the leader. This approach is simpler
to implement and maintain, but can result in reduced availability if the leader
system fails, as updates to the data can only be made on the leader system and
there is a risk of losing the data updates.
Figure 1-5. Multi-leader replication system setup vs Single leader replication system
setup
There is a risk of data loss if the leader system fails before newly written data can be
replicated to other nodes. And thus, the more read replicas that are used, the more
writes need to be replicated, which can lead to greater replication lag. In addition, the
use of read replicas can impact the performance of the system, as they may be bogged
down with replaying writes and unable to process as many reads. Furthermore,

replication can involve the use of additional hardware and can add complexity to
the system. Finally, some systems may have more efficient write performance on
the leader system, as it can spawn multiple threads to write in parallel, while read
replicas may only support writing sequentially with a single thread. Overall, the
choice of replication pattern depends on the specific requirements and constraints of
the system, including the desired level of availability and the cost of implementing the
replication solution.
We will cover how relational and non-relational datastores ensure

availability using single leader and multi-leader replication in
Chapter 2: Storage Types and Relational Stores and Chapter 3:
Non-relational Stores. Do look out for leaderless replication using
consistent hashing to ensure availability in non-relational stores
like key-value stores and columnar stores in detail in Chapter 3:
Non-relational Stores.
Reliability
In system design, reliability refers to the ability of a system or component to perform
its intended function consistently and without failure over a given period of time.
It is a measure of the dependability or trustworthiness of the system. Reliability is
typically expressed as a probability or percentage of time that the system will operate
without failure. For example, a system with a
reliability of 99% will fail only 1% of the time. Let’s try to quantify the measurement
of reliability of the system.
Measuring Reliability
One way to measure the reliability of a system is through the use of mean time
between failures(MTBF) and mean time to repair (MTTR).
Mean time between failures
Mean time between failures (MTBF) is a measure of the average amount of time
that a system can operate without experiencing a failure. It is typically expressed
in hours or other units of time. The higher the MTBF, the more reliable the
system is considered to be.
M ean time between failures M TBF

Total Elapsed Time − Sum total o f time system was down
= Total N umber o f Failures
Mean time to repair

Mean time to repair (MTTR) is a measure of the average amount of time it takes
to repair a failure in the system. It is also typically expressed in hours or other

units of time. The lower the MTTR, the more quickly the system can be restored
to operation after a failure.
Total M aintainence Time

M ean time to repair M TTR = Total N umber o f Repairs
Together, MTBF and MTTR can be used to understand the overall reliability of a
system. For example, a system with a high MTBF and a low MTTR is considered to
be more reliable than a system with a low MTBF and a high MTTR, as it is less likely
to experience failures and can be restored to operation more quickly when failures do
occur.
Reliability and Availability

It is important to note that reliability and availability are not mutually exclusive.
A system can be both reliable and available, or it can be neither. A system that is
reliable but not available is not particularly useful, as it may be reliable but not able to
perform its intended function when needed.
On the other hand, a system that is available but not reliable is also not useful, as it
may be able to perform its intended function when needed, but it may not do so con‐
sistently or without failure. In order to achieve high reliability and availability to meet
agreed service level objectives (SLO), it is important to design and maintain systems
with redundant components and robust failover mechanisms. It is also important to
regularly perform maintenance and testing to ensure that the system is operating at
its optimal level of performance.
In general, the reliability of a system is an important consideration in system design,
as it can impact the performance and availability of the system over time.
Service level objectives and goals including Change Management,

Problem Management and Service Request Management of AWS
Managed Services, which will be introduced in section 2 - Diving
Deep into AWS Services are provided in AWS documentation here.
Scalability
In system design, we need to ensure that the performance of the system increases
with the resources added based on the increasing workload, which can either be
request workload or data storage workload. This is referred to as scalability in system
design, which requires the system to respond to increased demand and load. For
example, a social network needs to scale with the increasing number of users as well
as the content feed on the platform, which it indexes and serves.

Scalability Patterns
To ensure scalability, there are two major complementary patterns to scale the system:
vertical scaling and horizontal scaling.
Vertical Scaling. Vertical scaling involves meeting the load requirements of the system
by increasing the capacity of a single server by upgrading it with more resources
(CPU, RAM, GPU, storage etc) as shown on the left in Figure 1-6. Vertical scaling is
useful when dealing with predictable traffic, as it allows for more resources to be used
to handle the existing demand. However, there are limitations to how much a single
server can scale up based on its current configuration and also, the cost of scaling
up is generally high as adding more higher end resources to the existing server will
require more dollars for high end configurations.
Horizontal Scaling . Horizontal scaling involves meeting the load requirements of the
system by increasing the number of the servers by adding more commodity servers
to serve the requests as shown on the right in Figure 1-6. Horizontal scaling is useful
when dealing with unpredictable traffic, as adding more servers increases the servers
capacity to handle more requests and if demand arises, more servers can further
be added to the pool cost-effectively However, though horizontal scaling provides
a better dollar cost proposition for scaling, the complexity of managing multiple
servers and ensuring they work collectively as an abstracted single server to handle
the workload is the catch here.
Figure 1-6. Vertical Scaling vs Horizontal Scaling

We will cover both scaling approaches and mechanisms in detail in
Chapter 5: Scaling Approaches and Mechanisms.
In early stage systems, you can start scaling up by vertically scaling the system and
adding better configuration to it and later, when you hit the limitation in further
scaling up, you can move to horizontally scaling the system.
Maintainability
In system design, maintainability is the ability of the system to be modified, adapted,
or extended to meet the changing needs of its users while ensuring smooth system
operations. In order for a software system to be maintainable, it must be designed to
be flexible and easy to modify or extend.
The maintainability of a system requires covering these three underlyings aspects of
the system:
Operability
This requires the system to operate smoothly under normal conditions and even
return back to normal operations within stipulated time after a fault. When a
system is maintainable in terms of operability, it reduces the time and effort
required to keep the system running smoothly. This is important because effi‐
cient operations and management contribute to overall system stability, reliabil‐
ity, and availability.
Lucidity
This requires the system to be simple and lucid to understand, extend to add
features and even, fix bugs. When a system is lucid, it enables efficient collab‐
oration among team members, simplifies debugging and maintenance tasks,
and facilitates knowledge transfer. It also reduces the risk of introducing errors
during modifications or updates.
Modifiability
This requires the system to be built in a modular way to allow it to be modified
and extended easily, without disrupting the functionality of other subsystems.
Modifiability is vital because software systems need to evolve over time to adapt
to new business needs, technological advancements, or user feedback. A system
that lacks modifiability can become stagnant, resistant to change, and difficult to
enhance or adapt to future demands.
By prioritizing maintainability, organizations can reduce downtime, lower mainte‐
nance costs, enhance productivity, and increase the longevity and value of their
software systems.

Fault Tolerance
Large scale systems generally employ a large number of servers and storage devices to
handle and respond to the user requests and store data. Fault tolerance requires the
system to recover from any failure (either hardware or software failure) and continue
to serve the requests. This requires avoiding single points of failure in the large
system and the ability to reroute requests to the functioning sub-systems to complete
the workload.
Fault tolerance needs to be supported at hardware as well as software levels, while
ensuring the data safety, i.e. making sure we don’t lose the data. There are two major
mechanisms to ensure data safety: replication and checkpointing.
Replication
Replication based fault tolerance ensures data safety as well as serving the request by
replicating both the service through multiple replica servers and also, replicating the
data through multiple copies of data across multiple storage servers. During a failure,
the failed node gets swapped with a fully functioning replica node. Similarly data is
also served again from a replica store, in case the data store has failed. The replication
patterns were already covered in the previous section in availability.
Checkpointing
Checkpointing based fault tolerance ensures that data is reliably stored and backed
up, even after the initial processing is completed. It allows for a system to recover
from any potential data loss, as it can restore a previous system state and prevent
data loss. Checkpointing is commonly used to ensure system and data integrity,
especially when dealing with large datasets. It can also be used to verify that data
is not corrupted or missing, as it can quickly detect any changes or issues in the
data and then take corrective measures. Checkpointing is an important tool for data
integrity, reliability, and security, as it ensures that all data is stored properly and
securely.
Recovery Manager of databases use checkpointing to ensure the

durability and reliability of the database in the event of failures or
crashes. This will be covered in detail in Chapter 2: Storage Types
and Relational Stores.
There are two checkpointing patterns — it can be done in either using synchronous
or asynchronous mechanisms.

Synchronous checkpointing
Synchronous checkpointing in a system is achieved by stopping all the data
mutation requests and allowing only read requests, while waiting for all the
checkpointing process to complete for the current data mutation to ensure its
integrity across all nodes. This always ensures consistent data state across all the
nodes.
Asynchronous checkpointing
Asynchronous checkpointing in a system is done by checkpointing asynchro‐
nously on all the nodes, while continuing to serve all the requests (including data
mutation requests) without waiting for the acknowledgement of the checkpoint‐
ing process to complete. This mechanism suffers from the possibility of having
inconsistent data state across the servers.
So now we have covered the basic concepts and requirements of a large scale system
and strive towards building a performant and scalable system—one that is also
highly available, reliable, maintainable and is fault-tolerant. Before diving deep into
how to build such systems, let’s go through the inherent fallacies as well as trade-offs
in designing such systems.
Fallacies of Distributed Computing

As a large-scale software system involves multiple distributed systems, it is often
subject to certain fallacies that can lead to incorrect assumptions and incorrect imple‐
mentations. These fallacies were first introduced by L. Peter Deutsch and they cover
the common false assumptions that software developers make while implementing
distributed systems. These eight fallacies are:
Reliable Network
The first fallacy is assuming that “The network is reliable”. Networks are complex,
dynamic and often, unpredictable. Small issues like switch or power failures
can even bring the entire network of a data-center down, making the network
unreliable. Thus, it is important to account for the potential of an unreliable
network while designing large scale systems, ensuring network fault tolerance
from the start. Given that networks are inherently unreliable, to build reliable
services on top we must rely on protocols that can cope with network outages
and packet loss.
Zero Latency
The second fallacy is assuming that “Latency is zero”. Latency is an inherent
limitation of networks, constrained by the speed of light, i.e. even in perfect
theoretical systems, the data can’t reach faster than the speed of light between
the nodes. Hence, to account for latency, the system should be designed to bring
the clients close to data through edge-computing and even choosing the servers

in the right geographic data centers closer to the clients and routing the traffic
wisely.
Infinite Bandwidth
The third fallacy is assuming that “Bandwidth is infinite”. When a high volume of
data is flowing through the network, there is always network resource contention
leading to queueing delays, bottlenecks, packet drops and network congestion.
Hence, to account for finite bandwidth, build the system using lightweight data
formats for the data in transit to preserve the network bandwidth and avoid
network congestion. Or use multiplexing, a technique that improves bandwidth
utilization by combining data from several sources and send it over the same
communication channel/medium.
Secure Network
The fourth fallacy is assuming that “The network is secure”. Assuming a network
is always secure, when there are multiple ways a network can be compromised
(ranging from software bugs, OS vulnerabilities, viruses and malwares, cross-site
scripting, unencrypted communication, malicious middle actors etc) can lead to
system compromise and failure. Hence to account for insecure networks, build
systems with a security first mindset and perform defense testing and threat
modelling of the built system.
Fixed Topology
The fifth fallacy is assuming that “Topology doesn’t change”. In distributed sys‐
tems, the topology changes continuously, because of node failures or node addi‐
tions. Building a system that assumes fixed topology will lead to system issues
and failures due to latency and bandwidth constraints. Hence, the underlying
topology must be abstracted out and the system must be built oblivious to the
underlying topology and tolerant to its changes.
Single Administrator
The sixth fallacy is assuming that “There is one administrator”. This can be a fair
assumption in a very small system like a personal project, but this assumption
breaks down in large scale distributed computing, where multiple systems have
separate OS, separate teams working on it and hence, multiple administrators.
To account for this, the system should be built in a decoupled manner, ensuring
repair and troubleshooting becomes easy and distributed too.
Zero Transport cost
The seventh fallacy is assuming that “Transport cost is zero”. Network infrastruc‐
ture has costs, including the cost of network servers, switches, routers, other
hardwares, the operating software of these hardware, and the team cost to keep it
running smoothly. Thus, the assumption that transporting data from one node to
Fallacies of Distributed Computing | 25

another is negligible is false and must consequently be noted in budgets to avoid
vast shortfalls.
Homogenous Network
The eight fallacy is assuming that “The network is homogeneous”. A network is
built with multiple devices with different configurations and using multiple pro‐
tocols at different layers and therefore we can’t assume a network is homogenous.
Taking into consideration the heterogeneity of the network as well as focusing
on the interoperability of the system,( i.e. ensuring subsystems can communicate
and work together despite having such differences) will help to avoid this pitfall.
The AWS Well-Architected Framework consists of six core pillars

that provide guidance and best practices for designing and building
systems on the AWS cloud, avoiding these fallacies and pitfalls.
Operational Excellence pillar avoids the fallacy of Single Adminis‐
trator and Homogenous Network. Security pillar avoids the fallacy
of Secure Network. Reliability pillar avoids the fallacy of Reliable
Network and Fixed Topology. Performance Efficiency pillar avoids
the fallacy of Zero Latency and Infinite Bandwidth. Cost Optimiza‐
tion pillar as well as Sustainability pillar avoids the fallacy of Zero
Transport Cost.
The book will not cover the AWS Well-Architected Framework in
detail and it is left as an excursion for the readers.
These fallacies cover the basic assumptions we should avoid while building large scale
systems. Overall, neglecting the fallacies of distributed computing can lead to a range
of issues, including system failures, performance bottlenecks, data inconsistencies,
security vulnerabilities, scalability challenges, and increased system administration
complexity. It is important to acknowledge and account for these fallacies during
the design and implementation of distributed systems to ensure their robustness,
reliability, and effective operation in real-world environments. Lets also, go through
the trade-offs in the next section, which are generally encountered in designing large
scale software systems.
System Design Trade-offs

System design involves making a number of trade-offs that can have a significant
impact on the performance and usability of a system. When designing a system, you
must consider factors like cost, scalability, reliability, maintainability, and robustness.
These factors must be balanced to create a system that is optimized for the particular
needs of the user. Ultimately, the goal is to create a system that meets the needs of the
user without sacrificing any of these important factors.

For example, if a system needs to be highly reliable but also have scalability, then you
need to consider the trade-offs between cost and robustness. A system with a high
level of reliability may require more expensive components, but these components
may also be robust and allow for scalability in the future. On the other hand, if cost is
a priority, then you may have to sacrifice robustness or scalability in order to keep the
system within budget.
In addition to cost and scalability, other trade-offs must be taken into account
when designing a system. Performance, security, maintainability, and usability are
all important considerations that must be weighed when designing a system. There
are some theoretical tradeoffs that arise in system design, which we will discuss in
this section.
Time vs Space
Space time trade-offs or time memory trade-offs arise inherently in implementation
of the algorithms in computer science for the workload, even in distributed systems.
This trade-off is necessary because system designers need to consider the time limita‐
tions of the algorithms and sometimes use extra memory or storage to make sure
everything works optimally. One example of such a trade-off is in using look-up
tables in memory or data storage instead of performing recalculation and thus,
serving more requests by just looking up pre-calculated values.
Latency vs Throughput
Another trade-off that arises inherently in system design is latency vs throughput.
Before diving into the trade-off, let’s make sure you understand these concepts
thoroughly.
Latency, Processing time and Response time
Latency is the time that a request waits to be handled. Until the request is picked
up to be handled, it is latent, inactive, queued or dormant.
Processing time, on the other hand, is the time taken by the system to process the
request, once it is picked up.
Hence, the overall response time is the duration between the request that was
sent and the corresponding response that was received, accounting for network
and server latencies.
Mathematically, it can be represented by the following formula:
ResponseTimeResponse Time = Latency + Processing Time
System Design Trade-offs | 27

Throughput and Bandwidth
Throughput and bandwidth are metrics of network data capacity, and are used
to account for network scalability and load. Bandwidth refers to the maximum
amount of data that could, theoretically, travel from one point in the network
to another in a given time. Throughput refers to the actual amount of data trans‐
mitted and processed throughout the network. Thus, bandwidth describes the
theoretical limit, throughput provides the empirical metric. The throughput is
always lower than the bandwidth unless the network is operating at its maximum
efficiency.
Bandwidth is a limited resource as each network device can only handle and
process limited capacity of data before passing it to the next network device, and
some devices consume more bandwidth than others. Insufficient bandwidth can
lead to network congestion, which slows connectivity.
Since latency measures how long the packets take to reach the destination in a net‐
work while throughput measures how many packets are processed within a specified
period of time, they have an inverse relationship. The more the latency the more they
would get queued up in the network, reducing the number of packets that are being
processed, leading to lower throughput.
Since the system is being gauged for lower latency under high throughput or load,
the metric to capture latency is through percentiles like p50, p90, p99 and so on. For
example, the p90 latency is the highest latency value (slowest response) of the fastest
90 percent of requests. In other words, 90 percent of requests have responses that are
equal to or faster than the p90 latency value. Note that average latency of a workload
is not used as a metric, as averages as point estimates are susceptible to outliers.
Because of the latency vs throughput trade-off, the latency metric will go down as
the load is increased on the system for higher throughput. Hence, systems should be
designed with an aim for maximal throughput within acceptable latency.
Performance vs Scalability
As discussed earlier in the chapter, scalability is the ability of the system to respond to
increased demand and load. On the other hand, performance is how fast the system
responds to a single request. A service is scalable if it results in increased performance
in a manner proportional to resources added. When a system has performance
problems, it is slow for a single user (p50 latency = 100ms) while when the system
has scalability problems, the system may be fast for some users (p50 latency = 1ms
for 100 requests) but slow under heavy load for the users (p50 latency = 100ms under
100k requests).

Consistency vs Availability
As discussed earlier in the chapter, strong consistency in data storage and retrieval is
the guarantee that every read receives the most recent write, while high availability is
the requirement of the system to always provide a non-error response to the request.
In a distributed system where the network fails (i.e. packets get dropped or delayed
due to the fallacies of distributed computing leading to partitions) there emerges an
inherent trade-off between strong consistency and high availability. This trade-off is
called the CAP theorem, also known as Brewer’s theorem.
CAP Theorem
The CAP theorem, as shown in Venn Diagram Figure 1-7, states that it is impossible
for a distributed system to simultaneously provide all three of the following guaran‐
tees: consistency (C), availability (A), and partition tolerance (P). According to the
theorem, a distributed system can provide at most two of these guarantees at any
given time. Systems need to be designed to handle network partitions as networks
aren’t reliable and hence, partition tolerance needs to be built in. So, in particular, the
CAP theorem implies that in the presence of a network partition, one has to choose
between consistency and availability.

Figure 1-7. CAP Theorem Venn Diagram
However, CAP is frequently misunderstood as if one has to choose to abandon one of

the three guarantees at all times. In fact, the choice is really between consistency and
availability only when a network partition or failure happens; at all other times, the
trade-off has to be made based on the PACLEC theorem.
PACELC Theorem
The PACELC theorem, as shown in Decision Flowchart Figure 1-8, is more nuanced
version of CAP theorem, which states that in the case of network partitioning (P)
in a distributed computer system, one has to choose between availability (A) and
consistency (C) (as per the CAP theorem), but else (E), even when the system is
running normally in the absence of partitions, one has to choose between latency
(L) and consistency (C). This trade-off arises naturally because to handle network
partitions, data and services are replicated in large scale systems, leading to the choice
between the consistency spectrum and the corresponding latency.
If the system tries to provide for strong consistency at the one end of the consis‐
tency spectrum model, it has to do replication with synchronous communication

and blocking to ensure all the read replicas receive the most recent write, waiting
on the acknowledgement from all the replica nodes, adding to high latency. On
the other hand, if the system does asynchronous replication without waiting for
acknowledgment from all nodes, it will end up providing eventual consistency (i.e.
the read request will eventually reflect the last recent write) when the replica node has
acknowledged the data mutation change for serving the requests.
Figure 1-8. PACLEC Theorem Decision Flowchart
In summary, the CAP and PACELC theorems are important concepts in distributed
systems design that provide a framework for understanding the trade-offs involved in
designing highly available and strongly consistent systems.
We will cover how non-relational stores navigate CAP theorem

trade-off by providing BASE property in detail in Chapter 3: Non-
relational Stores.
Given such requirements, fallacies and trade-offs in system design, in order to avoid
repeating mistakes of the past we should prescribe to a set of guidelines learnt by the
previous generation of software design practitioners. Let’s dig into those now.

System Design Guidelines
System design will always present the most interesting and challenging trade-offs,
and a system designer should be aware of the hidden costs and be well-equipped
to get it right — though not perfect! These guidelines, which have emerged over
years of practicing system design, guide us to avoid pitfalls and handle trade-offs
while designing large scale-systems. These guidelines aren’t just vague generalities but
virtues that help reflect on why the system was designed the way it was, why it was
implemented the way it is and why that was the right thing to do.
Guideline of Isolation: Build It Modularly

Controlling complexity is the essence of computer programming.
—Brian Kernighan
The first guideline is to build the system modularly, i.e. break down a complex system
into smaller, independent components or modules that can function independently,
yet also work together to form the larger system. Building it modularly helps in
improving all the requirements of the large scale system:
Maintainability
Modules can be updated or replaced individually without affecting the rest of the
system.
Reusability
Modules can be reused in different systems or projects, reducing the amount of
new development required.
Scalability
Modules can be added or removed and even scaled independently as needed to
accommodate changes in requirements or to support growth.
Reliability
Modules can be tested and validated independently, reducing the risk of system-
wide failures.
Modular systems can be implemented in a variety of ways, including through the
use of microservices architecture, component-based development, and modular pro‐
gramming, which we will cover in more detail in chapter 7. However, designing
modular systems can be challenging, as it requires careful consideration of the inter‐
faces between modules, data sharing and flow, and dependencies.
Guideline of Simplicity: Keep it Simple, Silly

Everything should be made as simple as possible, but no simpler

—Eric S. Raymond
The second guideline is to keep the design simple by avoiding complex and unneces‐
sary features and avoiding over-engineering. To build simple systems using the KISS
(Keep it Simple, Silly) guideline, designers can follow these steps:
1. 1. Identify the core requirements: Determine the essential features and functions
the system must have, and prioritize them.
2. 2. Minimize the number of components: Reduce the number of components or
modules in the system, making sure each component serves a specific purpose.
3. 3. Avoid over-engineering: Don’t add unnecessary complexity to the system, such
as adding features that are not necessary for its functioning.
4. 4. Make the system easy to use: Ensure the system is intuitive and straightforward
for users to use and understand.
5. 5. Test and refine: Test the system to ensure it works as intended and make
changes to simplify the system if necessary.
By following the KISS guideline, you as a system designer can build simple, efficient,
and effective systems that are easy to maintain and less prone to failure.
Guideline of Performance: Metrics Don’t Lie

Performance problems cannot be solved only through the use of Zen meditation.
—Jeffrey C. Mogul
The third guideline is to measure then build, and rely on the metrics as you can’t
cheat the performance and scalability. Metrics and observability are crucial for the
operation and management of large scale systems. These concepts are important for
understanding the behavior and performance of large-scale systems and for identify‐
ing potential issues before they become problems.
Metrics
Metrics are quantitative measures that are used to assess the performance of a
system. They provide a way to track key performance indicators, such as resource
utilization, response times, and error rates, and to identify trends and patterns
in system behavior. By monitoring metrics, engineers can detect performance
bottlenecks and anomalies, and take corrective actions to improve the overall
performance and reliability of the system.
Observability
Observability refers to the degree to which the state of a system can be inferred
from its externally visible outputs. This includes being able to monitor system
health and diagnose problems in real-time. Observability is important in large
System Design Guidelines | 33

scale systems because it provides a way to monitor the behavior of complex
systems and detect issues that may be impacting their performance.
Together, metrics and observability provide the information needed to make
informed decisions about the operation and management of large scale systems. By
ensuring that systems are properly instrumented with metrics and that observability
is designed into the system, you can detect and resolve issues more quickly, prevent
outages, and improve the overall performance and reliability of the system.
Guideline of Tradeoffs: There Is No Such Thing As A Free Lunch

Get it right. Neither abstraction nor simplicity is a substitute for getting it right.
—Butler Lampson
The fourth guideline is “there is no such thing as a free lunch” (TINSTAAFL),

pointing to the realization that all decisions come with trade-offs and that optimizing
for one aspect of a system often comes at the expense of others. In system design,
there are always trade-offs and compromises that must be made when designing a
system. For example, choosing a highly optimized solution for a specific problem
might result in reduced maintainability or increased complexity. Conversely, opting
for a simple solution might result in lower performance or increased latency.
This guideline TINSTAAFL highlights the need for careful consideration and balanc‐
ing of competing factors in system design, such as performance, scalability, reliability,
maintainability, and cost. Designers must weigh the trade-offs between these factors
and make informed decisions that meet the specific requirements and constraints of
each project.
Ultimately, you need to realize that there is no single solution that is optimal in all
situations and that as system designers, you must carefully consider the trade-offs
and implications of their decisions to build the right system.
Guideline of Use Cases: It Always Depends

Not everything worth doing is worth doing well.
—Tom West
The fifth guideline is that design always depends, as system design is a complex and
multifaceted process that is influenced by a variety of factors, including requirements,
user needs, technological constraints, cost, scalability, maintenance and even, regula‐
tions. By considering these and other factors, you can develop systems that meet the
needs of the users, are feasible to implement, and are sustainable over time. Since
there are many ways to design a system to solve a common problem, it indicates a
stronger underlying truth: there is no one “best” way to design the system, i.e. there is
no silver bullet. Thus, we settle for something reasonable and hope it is good enough.

Conclusion
This chapter has given you a solid introduction to how we design software systems.
We’ve talked about the main concepts behind system design, the trade-offs we have to
make when dealing with big software systems, the fallacies to avoid in building such
large scale systems and the smart guidelines to avoid such fallacies.
Think of system design like balancing on a seesaw – you have to find the right
balance between different trade-offs. As an overall guideline, system design is always
a trade-off between competing factors, and you as a system designer must carefully
balance these factors to create systems that are effective and efficient.
Now, in the next set of chapters (Part I), we’re going to explore the basic building
blocks of systems. We’ll talk about important topics, like where we store data, how we
speed things up with caches, how we distribute work with load balancers, and how
different parts of a system talk to each other through networking and orchestration.
Once you’ve got a handle on those basics, we’ll dive into the world of AWS systems in
Part II. This will help you understand how to make big and powerful systems using
Amazon Web Services. And all this knowledge will come together in Part III, where
we’ll put everything into practice and learn how to design specific use-cases and build
large-scale systems on AWS.
But before we get there, our next stop is exploring different ways to store data and
introduce you to relational databases in the next chapter.
Conclusion | 35
CHAPTER 4
Caching Policies and Strategies
A Note for Early Release Readers

With Early Release ebooks, you get books in their earliest form—the authors’ raw and
unedited content as they write—so you can take advantage of these technologies long
before the official release of these titles.
This will be the 4th chapter of the final book. Please note that the GitHub repo will be
made active later on.
If you have comments about how we might improve the content and/or examples in
this book, or if you notice missing material within this chapter, please reach out to the
editor at [email protected].
In computing, a cache is a component or mechanism used to store frequently

accessed data or instructions closer to the processor, reducing the latency of retriev‐
ing the information from slower storage or external resources. Caches are typically
implemented as high-speed, small-capacity memories located closer to the CPU than
the main memory. The goal is to improve overall system performance by reducing
the time required to access data or instructions.
The concept of cache revolves around the principle of locality, which suggests that
programs tend to access a relatively small portion of their data or instructions repeat‐
edly. By storing this frequently accessed information in a cache, subsequent access to
the same data can be served faster, resulting in improved performance.
When data is accessed from a cache, there are two possible outcomes: cache hit
and cache miss. A cache hit occurs when the requested data is found in the cache,
allowing for fast retrieval without accessing the slower main memory or external
resources. On the other hand, a cache miss happens when the requested data is not
37
present in the cache, requiring the system to fetch the data from the main memory
or external storage. Cache hit rates measure the effectiveness of the cache in serving
requests without needing to access slower external storage, while cache miss rates
indicate how often the cache fails to serve requested data.
This chapter will cover important information to help you understand how to use
data caching effectively. We’lll talk about cache eviction policies, which are rules for
deciding when to remove data from the cache to make retrieval of important data
faster. We’ll also cover cache invalidation, which ensures the cached data is always
correct and matches the real underlying data source. The chapter will also discuss
caching strategies for both read and write intensive applications. We’ll also cover
how to actually put caching into action, including where to put the caches to get the
best results. You’ll also learn about different ways caching works and why Content
Delivery Networks (CDNs) are important. And finally, you’ll learn about two popular
open-source caching solutions. So, let’s get started with the benefits of caching.
Caching Benefits
Caches play a crucial role in improving system performance and reducing latency for
several reasons:
Faster Access
Caches offer faster access times compared to main memory or external storage.
By keeping frequently accessed data closer to the CPU, cache access times can be
significantly lower, reducing the time required to fetch data.
Reduced Latency
Caches help reduce latency by reducing the need to access slower storage
resources. By serving data from a cache hit, the system avoids the delay associ‐
ated with fetching data from main memory or external sources, thereby reducing
overall latency.
Bandwidth Optimization
Caches help optimize bandwidth usage by reducing the number of requests sent
to slower storage. When data is frequently accessed from the cache, it reduces the
demand on the memory bus or external interfaces, freeing up resources for other
operations.
Improved Throughput
Caches improve overall system throughput by allowing the CPU to access fre‐
quently needed data quickly, without waiting for slower storage access. This
enables the CPU to perform more computations in a given amount of time,
increasing overall system throughput.
38 | Chapter 4: Caching Policies and Strategies

Amdahl’s Law and the Pareto distribution provide further insights into the benefits of
caching:
Amdahl’s Law
Amdahl’s Law states that the overall speedup achieved by optimizing a particular
component of a system is limited by the fraction of time that component is
utilized. Caches, being a critical optimization component, can have a significant
impact on overall system performance, especially when the fraction of cache
hits is high. Amdahl’s Law emphasizes the importance of efficient caching to
maximize the benefits of performance optimization.
Pareto Distribution
The Pareto distribution, also known as the 80/20 rule, states that a significant
portion of the system’s workload is driven by a small fraction of the data.
Caching aligns well with this distribution by allowing the frequently accessed
data to reside in a fast cache, serving the most critical operations efficiently. By
focusing caching efforts on the most accessed data, the Pareto distribution can be
leveraged to optimize performance for the most important workloads.
In summary, caches provide faster access to frequently accessed data, reducing
latency and improving overall system performance. They help optimize bandwidth,
increase throughput, and align with principles such as Amdahl’s Law and the Pareto
distribution to maximize performance benefits.
The next section will cover different policies to perform cache eviction, like tech‐
niques such as least recently used (LRU) and least frequently used (LFU), which can
help you choose the best caching policy for different situations.
Cache Eviction Policies

Caching plays a crucial role in improving the performance and efficiency of data
retrieval systems by storing frequently accessed data closer to the consumers. Caching
policies determine how the cache handles data eviction and replacement, when its
capacity is reached. Cache eviction policies try to maximize the cache hit ratio—the
percentage of time the requested item was found in the cache and served. Higher
cache hit ratio reduces the necessity to retrieve data from external storage, resulting
in better system performance. In this section, we will explore various caching poli‐
cies, including Belady’s algorithm, queue-based policies (FIFO, LIFO), recency-based
policies (LRU, TLRU, MRU, SLRU), and frequency-based policies (LFU, LFRU).
Belady’s Algorithm
Belady’s algorithm is an optimal caching algorithm that evicts the data item that will
be used furthest in the future. It requires knowledge of the future access pattern,
Cache Eviction Policies | 39

which is usually impractical to obtain. Belady’s algorithm serves as a theoretical
benchmark for evaluating the performance of other caching policies.
Queue-Based Policies
Queue-based cache eviction policies involve managing the cache by treating it
like a queue. When the cache reaches its capacity, one of the queue-based policies
is used to remove data to make space for new data.
FIFO (First-In-First-Out)
FIFO is a simple caching policy that evicts the oldest data item from the
cache. It follows the principle that the first data item inserted into the cache
is the first one to be evicted when the cache is full. FIFO is easy to implement
but may suffer from the “aging” problem, where recently accessed items are
evicted prematurely.
LIFO (Last-In-First-Out)
LIFO is the opposite of FIFO, where the most recently inserted data item is
the first one to be evicted. LIFO does not consider the access pattern and can
result in poor cache utilization and eviction decisions.
Recency-Based Policies
Recency-based cache eviction policies focus on the time aspect of data access.
These policies prioritize keeping the most recently accessed items in the cache.
LRU (Least Recently Used)
LRU is a popular caching policy that evicts the least recently accessed data
item from the cache. It assumes that recently accessed items are more likely
to be accessed in the near future. LRU requires tracking access timestamps
for each item, making it slightly more complex to implement.
MRU (Most Recently Used)
MRU evicts the most recently accessed data item from the cache. It assumes
that the most recently accessed item is likely to be accessed again soon.
MRU can be useful in scenarios where a small subset of items is accessed
frequently.
Frequency-Based Policies
Frequency-based cache eviction policies prioritize retaining items in the cache
based on how often they are accessed. The cache replaces items that have been
accessed the least frequently, assuming that rarely accessed data may not be as
critical for performance optimization.
LFU (Least Frequently Used)
LFU evicts the least frequently accessed data item from the cache. It assumes
that items with lower access frequency are less likely to be accessed in the

future. LFU requires maintaining access frequency counts for each item,
which can be memory-intensive.
LFRU (Least Frequently Recently Used)
LFRU combines the concepts of LFU and LRU by considering both the
frequency of access and recency of access. It evicts the item with the lowest
frequency count among the least recently used items.
Allowlist Policy
An allowlist policy for cache replacement is a mechanism that defines a set of
prioritized items eligible for retention in a cache when space is limited. Instead
of using a traditional cache eviction policy that removes the least recently used or
least frequently accessed items, an allowlist policy focuses on explicitly specifying
which items should be preserved in the cache. This policy ensures that important
or high-priority data remains available in the cache, even during periods of cache
pressure. By allowing specific items to remain in the cache while evicting others,
the allowlist policy optimizes cache utilization and improves performance for critical
data access scenarios.
Caching policies serve different purposes and exhibit varying performance charac‐
teristics based on the access patterns and workload of the system. Choosing the
right caching policy depends on the specific requirements and characteristics of the
application.
By understanding and implementing these caching policies effectively, system design‐
ers and developers can optimize cache utilization, improve data retrieval perfor‐
mance, and enhance the overall user experience. Let’s discuss different cache
invalidation strategies, which are applied post identifying which data to evict based
on the above cache eviction policies.
Cache Invalidation
Cache invalidation is a crucial aspect of cache management that ensures the cached
data remains consistent with the underlying data source. Effective cache invalidation
strategies help maintain data integrity and prevent stale or outdated data from being
served. Here are three common cache invalidation techniques: active invalidation,
invalidating on modification, invalidating on read, and time-to-live (TTL).
Active Invalidation
Active invalidation involves explicitly removing or invalidating cached data when
changes occur in the underlying data source. This approach requires the appli‐
cation or the system to actively notify or trigger cache invalidation operations.
For example, when data is modified or deleted in the data source, the cache
is immediately updated or cleared to ensure that subsequent requests fetch the
Cache Invalidation | 41
latest data. Active invalidation provides precise control over cache consistency
but requires additional overhead to manage the invalidation process effectively.
Invalidating on Modification
With invalidating on modification, the cache is invalidated when data in the
underlying data source is modified. When a modification operation occurs,
such as an update or deletion, the cache is notified or flagged to invalidate
the corresponding cached data. The next access to the invalidated data triggers
a cache miss, and the data is fetched from the data source, ensuring the cache
contains the most up-to-date information. This approach minimizes the chances
of serving stale data but introduces a slight delay for cache misses during the
invalidation process.
Invalidating on Read
In invalidating on read, the cache is invalidated when the cached data is accessed
or read. Upon receiving a read request, the cache checks if the data is still valid
or has expired. If the data is expired or flagged as invalid, the cache fetches the
latest data from the data source and updates the cache before serving the request.
This approach guarantees that fresh data is always served, but it adds overhead
to each read operation since the cache must validate the data’s freshness before
responding.
Time-to-Live (TTL)
Time-to-Live is a cache invalidation technique that associates a time duration
with each cached item. When an item is stored in the cache, it is marked with
a TTL value indicating how long the item is considered valid. After the TTL
period elapses, the cache treats the item as expired, and subsequent requests for
the expired item trigger cache misses, prompting the cache to fetch the latest
data from the data source. TTL-based cache invalidation provides a simple and
automatic way to manage cache freshness, but it may result in serving slightly
stale data until the TTL expires.
The choice of cache invalidation strategy depends on factors such as the nature of
the data, the frequency of updates, the performance requirements, and the desired
consistency guarantees. Active invalidation offers precise control but requires active
management, invalidating on modification ensures immediate data freshness, invalid‐
ating on read guarantees fresh data on every read operation, and TTL-based invalida‐
tion provides a time-based expiration mechanism. Understanding the characteristics
of the data and the system’s requirements helps in selecting the appropriate cache
invalidation strategy to maintain data consistency and improve overall performance.
The next section covers different caching strategies to ensure that data is properly
consistent between the cache and the underlying data source.

Caching Strategies
Caching strategies define how data is managed and synchronized between the cache
and the underlying data source. In this section, we will explore several caching
strategies as shown in Figure 4-1, including cache-aside, read-through, refresh-ahead,
write-through, write-around, and write-back.
Figure 4-1. Caching Strategies
The left-hand side of the diagram displays read-intensive caching strategies, focusing
on optimizing the retrieval of data that is frequently read or accessed. The goal
of a read-intensive caching strategy is to minimize the latency and improve the
overall performance of read operations by serving the cached data directly from
memory, which is much faster than fetching it from a slower and more distant data
source. This strategy is particularly beneficial for applications where the majority of
operations involve reading data rather than updating or writing data.
Let’s take a look at those in more detail:
Cache-Aside
Cache-aside caching strategy, also known as lazy loading, delegates the responsi‐
bility of managing the cache to the application code. When data is requested,
the application first checks the cache. If the data is found, it is returned from
the cache. If the data is not in the cache, the application retrieves it from the
data source, stores it in the cache, and then returns it to the caller. Cache-aside
caching offers flexibility as the application has full control over caching decisions
but requires additional logic to manage the cache.
Caching Strategies | 43
Read-Through
Read-through caching strategy retrieves data from the cache if available; other‐
wise, it fetches the data from the underlying data source. When a cache miss
occurs for a read operation, the cache retrieves the data from the data source,
stores it in the cache for future use, and returns the data to the caller. Subsequent
read requests for the same data can be served directly from the cache, improving
the overall read performance. This strategy offloads the responsibility of manag‐
ing cache lookups from the application unlike Cache-aside strategy, providing a
simplified data retrieval process.
Refresh-Ahead
Refresh-ahead caching strategy, also known as prefetching, proactively retrieves
data from the data source into the cache before it is explicitly requested. The
cache anticipates the future need for specific data items and fetches them in
advance. By prefetching data, the cache reduces latency for subsequent read
requests and improves the overall data retrieval performance.
The right-hand side of the diagram displays the write-intensive strategies, focussing
around optimizing the storage and management of data that is frequently updated
or written. Unlike read-intensive caching, where the focus is on optimizing data
retrieval, a write-intensive caching strategy aims to enhance the efficiency of data
updates and writes, while still maintaining acceptable performance levels. In a write-
intensive caching strategy, the cache is designed to handle frequent write operations,
ensuring that updated data is stored temporarily in the cache before being eventually
synchronized with the underlying data source, such as a database or a remote server.
This approach can help reduce the load on the primary data store and improve the
application’s responsiveness by acknowledging write operations more quickly.
Let’s take a look at those in more detail:
Write-Through
Write-through caching strategy involves writing data to both the cache and the
underlying data source simultaneously. When a write operation occurs, the data
is first written to the cache and then immediately propagated to the persistent
storage synchronously before the write operation is considered complete. This
strategy ensures that the data remains consistent between the cache and the data
source. However, it may introduce additional latency due to the synchronous
write operations.
Write-Around
Write-around caching strategy involves bypassing the cache for write operations.
When the application wants to update data, it writes directly to the underlying
data source, bypassing the cache. As a result, the written data does not reside
in the cache, reducing cache pollution with infrequently accessed data. However,

subsequent read operations for the updated data might experience cache misses
until the data is fetched again from the data source and cached.
Write-Back
Write-back caching strategy allows write operations to be performed directly
on the cache, deferring the update to the underlying data source until a later
time. When data is modified in the cache, the change is recorded in the cache
itself, and the update is eventually propagated to the data source asynchronously
on schedule or or when specific conditions are met (e.g., cache eviction, time
intervals). Write-back caching provides faster write operations by reducing the
number of immediate disk writes. However, it introduces a potential risk of data
loss in the event of a system failure before the changes are flushed to the data
source.
Each caching strategy has its own advantages and considerations, and the selection
of an appropriate strategy depends on the specific requirements and characteristics of
the system.
By understanding these caching strategies, system designers and developers can make
informed decisions to optimize data access and improve the overall performance of
their applications. Let’s cover different deployment options for a cache in the overall
system and how it affects the performance and data sharing.
Caching Deployment
When deploying a cache, various deployment options are available depending on
the specific requirements and architecture of the system. Here are three common
cache deployment approaches: in-process caching, inter-process caching, and remote
caching.
In-Process Caching
In in-process caching, the cache resides within the same process or application as
the requesting component. The cache is typically implemented as an in-memory
data store and is directly accessible by the application or service. In-process
caching provides fast data access and low latency since the cache is located within
the same process, enabling direct access to the cached data. This deployment
approach is suitable for scenarios where data sharing and caching requirements
are limited to a single application or process.
Inter-Process Caching
Inter-process caching involves deploying the cache as a separate process or ser‐
vice that runs alongside the applications or services. The cache acts as a dedicated
caching layer that can be accessed by multiple applications or processes. Appli‐
cations communicate with the cache using inter-process communication mecha‐
nisms such as shared memory, pipes, sockets, or remote procedure calls (RPC).
Caching Deployment | 45
Inter-process caching allows multiple applications to share and access the cached
data, enabling better resource utilization and data consistency across different
components. It is well-suited for scenarios where data needs to be shared and
cached across multiple applications or processes within a single machine.
Remote Caching
Remote caching involves deploying the cache as a separate service or cluster
that runs on a different machine or location than the requesting components.
The cache service is accessed remotely over a network using protocols such as
HTTP, TCP/IP, or custom communication protocols. Remote caching enables
distributed caching and can be used to share and cache data across multiple
machines or even geographically distributed locations. It provides scalability,
fault-tolerance, and the ability to share cached data among different applications
or services running on separate machines. Remote caching is suitable for scenar‐
ios that require caching data across a distributed system or when the cache needs
to be accessed by components running on different machines.
The choice of cache deployment depends on factors such as the scale of the system,
performance requirements, data sharing needs, and architectural considerations. In-
process caching offers low latency and direct access to data within a single process,
inter-process caching enables sharing and caching data across multiple applications
or processes, and remote caching provides distributed caching capabilities across
multiple machines or locations. Understanding the specific requirements and charac‐
teristics of the system helps in selecting the appropriate cache deployment strategy
to optimize performance and resource utilization. Let’s cover different caching mech‐
anisms to improve application performance in the next section.
Caching Mechanisms
In this section, we will explore different caching mechanisms, including client-side
caching, CDN caching, web server caching, application caching, database caching,
query-level caching, and object-level caching.
Client-side Caching
Client-side caching involves storing cached data on the client device, typically in
the browser’s memory or local storage. This mechanism allows web applications
to store and retrieve static resources, such as HTML, CSS, JavaScript, and images,
directly from the client’s device. Client-side caching reduces the need to fetch
resources from the server on subsequent requests, leading to faster page load
times and improved user experience.
CDN Caching
Content Delivery Network (CDN) caching is a mechanism that involves cach‐
ing static content on distributed servers strategically located across different

geographic regions. CDNs serve cached content to users based on their proximity
to the CDN server, reducing the latency and load on the origin server. CDN
caching is commonly used to cache static files, media assets, and other frequently
accessed content, improving the overall performance and scalability of web appli‐
cations.
Web Server Caching
Web server caching refers to caching mechanisms implemented at the server-side
to store frequently accessed content. When a request is made to the server, it first
checks if the requested content is already cached. If found, the server serves the
cached content directly, avoiding the need to regenerate the content. Web server
caching is effective for static web pages, dynamic content with a long expiration
time, and content that is expensive to generate.
Application Caching
Application caching involves caching data within the application’s memory or
in-memory databases. It is typically used to store frequently accessed data or
computation results that are costly to generate or retrieve from other data
sources. Application caching improves response times by reducing the need for
repeated data retrieval and computation, enhancing the overall performance of
the application.
Database Caching
Database caching focuses on improving the performance of database operations
by caching frequently accessed data or query results. This caching mechanism
can be implemented at different levels: query-level caching and object-level cach‐
ing.
Query-Level Caching
Query-level caching involves storing the results of frequently executed quer‐
ies in memory. When the same query is executed again, the cached result
is served instead of querying the database again, reducing the database load
and improving response times.
Object-Level Caching
Object-level caching caches individual data objects or records retrieved from
the database. This mechanism is useful when accessing specific objects fre‐
quently or when the database is relatively static. Object-level caching reduces
the need for frequent database queries, improving overall application perfor‐
mance.
By employing these caching mechanisms as shown in Figure 4-3, organizations and
developers can optimize data retrieval, reduce latency, and improve the scalability
and responsiveness of their systems. However, it is essential to carefully consider
Caching Mechanisms | 47
cache invalidation, cache coherence, and cache management strategies to ensure the
consistency and integrity of the cached data.
Figure 4-2. Caching mechanisms employed at different stages
Out of the above mechanisms, Content Delivery Networks (CDNs) play a crucial
role in improving the performance and availability of web content to end-users by
reducing latency and enhancing scalability by caching at edge locations. Let’s cover
CDNs in detail in the next section.
Content Delivery Networks

CDNs employ various strategies and architectural models to efficiently distribute
and cache content across geographically distributed servers. This section explores
different types of CDNs, including push and pull CDNs, optimization techniques for
CDNs, and methods for ensuring content consistency within CDNs.
CDNs can be categorized into two main types: push and pull CDNs.
Push CDN
In a push CDN, content is pre-cached and distributed to edge servers in advance.
The CDN provider proactively pushes content to edge locations based on predic‐
ted demand or predetermined rules. With push CDNs, content is only uploaded
when it is new or changed, reducing traffic while maximizing storage efficiency.
This approach ensures faster content delivery as the content is readily available at
the edge servers when requested by end-users.
Push CDNs are suitable for websites with low traffic or content that doesn’t
require frequent updates. Instead of regularly pulling content from the server, it
is uploaded to the CDNs once and remains there until changes occur.
Pull CDN
In a pull CDN, content is cached on-demand. The CDN servers pull the content
from the origin server when the first user requests it. The content is then cached
at the edge servers for subsequent requests, optimizing delivery for future users.
The duration for which content is cached is determined by a time-to-live (TTL)

setting. Pull CDNs minimize storage space on the CDN, but there can be redun‐
dant traffic if files are pulled before their expiration, resulting in unnecessary
data transfer.
Pull CDNs are well-suited for websites with high traffic since recently-requested
content remains on the CDN, evenly distributing the load.
CDNs employ different optimization techniques to improve the performance of
caching at the edge server. Let’s cover some of these optimization techniques in detail.
Dynamic Content Caching Optimization
CDNs face challenges when caching dynamic content that frequently changes
based on user interactions or real-time data. To optimize dynamic content cach‐
ing, CDNs employ various techniques such as:
Content Fragmentation
Breaking down dynamic content into smaller fragments to enable partial
caching and efficient updates.
Edge-Side Includes (ESI)
Implementing ESI tags to separate dynamic and static content, allowing
dynamic portions to be processed on-the-fly while caching the static frag‐
ments.
Content Personalization
Leveraging user profiling and segmentation techniques to cache personalized
or user-specific content at the edge servers.
Multi-tier CDN architecture
Multi-tier CDN architecture involves the distribution of content across multiple
layers or tiers of edge servers. This approach allows for better scalability, fault
tolerance, and improved content delivery to geographically diverse regions. It
enables efficient content replication and reduces latency by bringing content
closer to end-users.
DNS Redirection
DNS redirection is a mechanism employed by CDNs to direct user requests
to the nearest or most suitable edge server. By resolving DNS queries to the
most appropriate server based on factors like geographic proximity, network
conditions, and server availability, CDNs optimize content delivery and reduce
latency.
Client Multiplexing
Client multiplexing refers to the technique of combining multiple HTTP requests
and responses into a single connection between the client and the edge server.
This reduces the overhead of establishing multiple connections and improves
Content Delivery Networks | 49

efficiency, especially for small object requests, resulting in faster content delivery.
Content Consistency in CDNs
Ensuring content consistency across multiple edge servers within a CDN is crucial to
deliver the most up-to-date and accurate content. CDNs employ various methods to
maintain content consistency, including:
Periodic Polling
CDNs periodically poll the origin server to check for updates or changes in
content. This ensures that cached content is refreshed to reflect the latest version.
Time-to-Live (TTL)
CDNs utilize Time-to-Live values, specified in HTTP headers or DNS records,
to determine how long cached content remains valid. Once the TTL expires, the
CDN fetches updated content from the origin server.
Leases
CDNs use lease-based mechanisms to control the duration of content caching at
the edge servers. Leases define a specific time window during which the content
remains valid before requiring renewal or revalidation.
AWS offers Amazon Cloudfront, a pull CDN offering built for high
performance, security, and developer convenience, which we will
cover in more detail in Chapter 9 - AWS Network Services.
Before ending the section, let’s also understand that using a CDN can come with
certain drawbacks also due to cost, stale content and frequent URL changes. CDNs
may involve significant costs depending on the amount of traffic. However, it’s impor‐
tant to consider these costs in comparison to the expenses you would incur without
utilizing a CDN. If updates are made before the TTL expires, there is a possibility
of content being outdated until it is refreshed on the CDN. CDNs require modifying
URLs for static content to point to the CDN, which can be an additional task to
manage.
Overall, CDNs offer benefits in terms of performance and scalability but require
careful consideration of these factors and the specific needs of your website. At the
end of this chapter, let’s dive deeper into two popular open-source caching solutions
to understand their architecture and how they implement the caching concepts
discussed in the chapter.

Open Source Caching Solutions
Open source caching solutions, such as Redis and Memcached, have gained popular‐
ity due to their efficiency, scalability, and ease of use. Let’s take a closer look at
Memcached and Redis, two widely adopted open-source caching solutions.
Memcached
Memcached is an open-source, high-performance caching solution widely used in
web applications. It operates as a distributed memory object caching system, storing
data in memory across multiple servers. Here are some key features and benefits of
Memcached:
Simple and Lightweight
Memcached is designed to be simple, lightweight, and easy to deploy. It focuses
solely on caching and provides a straightforward key-value interface for data
storage and retrieval.
Horizontal Scalability
Memcached follows a distributed architecture, allowing it to scale horizontally by
adding more servers to the cache cluster. This distributed approach ensures high
availability, fault tolerance, and improved performance for growing workloads.
Protocol Compatibility
Memcached adheres to a simple protocol that is compatible with various pro‐
gramming languages. This compatibility makes it easy to integrate Memcached
into applications developed in different languages.
Transparent Caching Layer
Memcached operates as a transparent caching layer, sitting between the applica‐
tion and the data source. It helps alleviate database or API load by caching
frequently accessed data, reducing the need for repetitive queries.
Let’s take a look at Memcached’s architecture.
Memcached Architecture
Memcached’s architecture consists of a centralized server that coordinates the storage
and retrieval of cached data. When a client sends a request to store or retrieve data,
the server handles the request and interacts with the underlying memory allocation
strategy.
Memcached follows a multi-threaded architecture that enables it to efficiently han‐
dle concurrent requests and scale across multiple CPU cores. In this architecture,
Memcached utilizes a pool of worker threads that can simultaneously process cli‐
ent requests. Each worker thread is responsible for handling a subset of incoming
Open Source Caching Solutions | 51

requests, allowing for parallel execution and improved throughput. This multi-
threaded approach ensures that Memcached can effectively handle high traffic loads
and distribute the processing workload across available CPU resources. By leveraging
multiple threads, Memcached can achieve better performance and responsiveness,
making it suitable for demanding caching scenarios where high concurrency is a
requirement.
In terms of memory allocation, Memcached employs a slab allocation strategy. It
divides the allocated memory into fixed-size chunks called slabs. Each slab is further
divided into smaller units known as pages. These pages are then allocated to store
individual cache items. The slab allocation strategy allows Memcached to efficiently
manage memory by grouping items of similar sizes together. It reduces memory
fragmentation and improves memory utilization.
When a new item is added to the cache, Memcached determines the appropriate
slab size for the item based on its size. If an existing slab with enough free space is
available, the item is stored in that slab. Otherwise, Memcached allocates a new slab
from the available memory pool and adds the item to that slab. The slab allocation
strategy enables efficient memory utilization and allows Memcached to store a large
number of items in memory while maintaining optimal performance.
Overall, Memcached’s architecture and memory allocation strategy work together to
provide a lightweight and efficient caching solution that can handle high traffic loads
and deliver fast data access times. By leveraging memory effectively and employing
a scalable architecture, Memcached enables applications to significantly improve
performance by caching frequently accessed data in memory.
Redis
Redis, short for Remote Dictionary Server, is a server-based in-memory data struc‐
ture store that can serve as a high-performance cache. Unlike traditional databases
that rely on iterating, sorting, and ordering rows, Redis organizes data in customiz‐
able data structures from the ground up, supporting a wide range of data types,
including strings, bitmaps, bitfields, lists, sets, hashes, geospatial, hyperlog and more,
making it versatile for various caching use cases. Here are some key features and
benefits of Redis:
High Performance
Redis is designed for speed, leveraging an in-memory storage model that allows
for extremely fast data retrieval and updates. It can handle a massive number of
operations per second, making it suitable for high-demand applications.
Persistence Options
Redis provides persistence options that allow data to be stored on disk, ensuring
durability even in the event of system restarts. This feature makes Redis suitable

for use cases where data needs to be retained beyond system restarts or cache
invalidations.
Advanced Caching Features
Redis offers advanced caching features, such as expiration times, eviction poli‐
cies, and automatic cache invalidation based on time-to-live (TTL) values. It also
supports data partitioning and replication for scalability and fault tolerance.
Pub/Sub and Messaging
Redis includes publish/subscribe (pub/sub) messaging capabilities, enabling real-
time messaging and event-driven architectures. This makes it suitable for scenar‐
ios involving real-time data updates and notifications.
Redis serves as an in-memory database primarily used as a cache in front of other
databases like MySQL or PostgreSQL. By leveraging the speed of memory, Redis
enhances application performance and reduces the load on the main database. It is
particularly useful for storing data that changes infrequently but is frequently reques‐
ted, as well as data that is less critical but undergoes frequent updates. Examples
of such data include session or data caches, leaderboard information, and roll-up
analytics for dashboards.
Redis architecture is designed for high performance, low latency, and simplicity. It
provides a range of deployment options for ensuring high availability based on the
requirements and cost constraints. Let’s go over the availability in Redis deployments
in detail, followed by persistence models for redis durability and memory manage‐
ment in Redis.
Availability in Redis Deployments

Redis supports different deployment architectures as shown in Figure 4-3, including
a single Redis instance, Redis HA (High Availability), Redis Sentinel, and Redis
Cluster. Each architecture has its trade-offs and is suitable for different use cases and
scalability needs.
Figure 4-3. Redis Deployment Setups

Single Redis Instance
In a single Redis instance setup, Redis is deployed as a standalone server. While
it is straightforward and suitable for small instances, it lacks fault tolerance. If the
instance fails or becomes unavailable, all client calls to Redis will fail, affecting
overall system performance.
Redis HA (High Availability)
Redis HA involves deploying a main Redis instance with one or more secondary
instances that synchronize with replication. The secondary instances can help
scale read operations or provide failover in case the main instance is lost. Repli‐
cation ID and offset play a crucial role in the synchronization process, allowing
secondary instances to catch up with the main instance’s data.
Redis Sentinel
Redis Sentinel is a distributed system that ensures high availability for Redis.
Sentinel processes coordinate the state and monitor the availability of main and
secondary instances. They also serve as a point of discovery for clients, informing
them of the current main instance. Sentinel processes can start a failover process
if the primary instance becomes unavailable.
Redis Cluster
Redis Cluster enables horizontal scaling by distributing data across multiple
machines or shards. Algorithmic sharding is used to determine which Redis
instance (shard) holds a specific key. Redis Cluster employs a hashsloting mech‐
anism to map data to shards and allows for seamless resharding when adding
new instances to the cluster. Gossip Protocol is used in Redis Cluster to maintain
cluster health. Nodes constantly communicate to determine the availability of
shards and can promote secondary instances to primary if needed.
Durability in Redis Deployment

Redis provides two persistence models for data durability: RDB files (Redis Database
Files) and AOF (Append-Only File). These persistence mechanisms ensure that data
is not lost in case of system restarts or crashes. Let’s explore both models in more
detail:
RDB Files (Redis Database Files). RDB is the default persistence model in Redis. It peri‐
odically creates snapshots of the dataset and saves them as binary RDB files. These
files capture the state of the Redis database at a specific point in time. Here are key
features and considerations of RDB persistence:
Snapshot-based Persistence
RDB persistence works by periodically taking snapshots of the entire dataset
and storing it in a file. The frequency of snapshots can be configured based on
requirements.

Efficiency and Speed
RDB files are highly efficient in terms of disk space usage and data loading speed.
They are compact and can be loaded back into Redis quickly, making it suitable
for scenarios where fast recovery is essential.
Full Data Recovery
RDB files provide full data recovery as they contain the entire dataset. In case of
system failures, Redis can restore the data by loading the most recent RDB file
available.
However, it’s worth noting that RDB files have some limitations. Since they are
snapshots, they do not provide real-time durability and may result in data loss if a
crash occurs between two snapshot points. Additionally, restoring large RDB files can
take time and impact the system’s performance during the recovery process.
AOF (Append-Only File). AOF persistence is an alternative persistence model in Redis

that logs every write operation to an append-only file. AOF captures a sequential log
of write operations, enabling Redis to reconstruct the dataset by replaying the log.
Here are key features and considerations of AOF persistence:
Write-ahead Log
AOF persists every write operation to the append-only file as a series of com‐
mands or raw data. This log can be used to rebuild the dataset from scratch.
Durability and Flexibility
AOF offers more durability than RDB files since it captures every write opera‐
tion. It provides the ability to recover data up to the last executed command.
Moreover, AOF offers different persistence options (such as every write, every
second, or both) to balance durability and performance.
Append-only Nature
AOF appends new write operations to the end of the file, ensuring that the
original dataset is never modified. This approach protects against data corruption
caused by crashes or power failures.
However, AOF persistence comes with its own considerations. The append-only file
can grow larger over time, potentially occupying significant disk space. Redis offers
options for AOF file rewriting to compact the log and reduce its size. Additionally,
AOF persistence typically has a slightly higher performance overhead compared to
RDB files due to the need to write every command to disk.
In practice, Redis users often employ a combination of RDB and AOF persistence
based on their specific requirements and trade-offs between performance, durability,
and recovery time objectives.

It’s important to note that Redis also provides an option to use no persistence (volatile
mode) if durability is not a primary concern or if data can be regenerated from an
external source in the event of a restart or crash.
Memory Management in Redis

Redis leverages forking and copy-on-write (COW) techniques to facilitate data per‐
sistence efficiently within its single-threaded architecture. When Redis performs a
snapshot (RDB) or background saving operation, it follows these steps:
1. 1. Forking: Redis uses the fork() system call to create a child process, which is
an identical copy of the parent process. Forking is a lightweight operation as it
creates a copy-on-write clone of the parent’s memory.
2. 2. Copy-on-Write (COW): Initially, the child process shares the same memory
pages with the parent process. However, when either the parent or child process
modifies a memory page, COW comes into play. Instead of immediately dupli‐
cating the modified page, the operating system creates a new copy only when
necessary.
By employing COW, Redis achieves the following benefits:

Memory Efficiency
When the child process is initially created, it shares the same memory pages with
the parent process. This shared memory approach consumes minimal additional
memory. Only the modified pages are copied when necessary, saving memory
resources.
Performance
Since only the modified pages are duplicated, Redis can take advantage of the
COW mechanism to perform persistence operations without incurring a signif‐
icant performance overhead. This is particularly beneficial for large datasets
where copying the entire dataset for persistence would be time-consuming.
Fork Safety
Redis uses fork-based persistence to avoid blocking the main event loop during
the snapshot process. By forking a child process, the parent process can continue
serving client requests while the child process performs the persistence operation
independently. This ensures high responsiveness and uninterrupted service.
It’s important to note that while forking and COW provide memory efficiency and
performance benefits, they also have considerations. Forking can result in increased
memory usage during the copy-on-write process if many modified pages need to
be duplicated. Additionally, the fork operation may be slower on systems with large
memory footprints.

Overall, Redis effectively utilizes forking and copy-on-write mechanisms within its
single-threaded architecture to achieve efficient data persistence. By employing these
techniques, Redis can perform snapshots and background saving operations without
significantly impacting its performance or memory usage.
Overall, Redis offers developers a powerful and flexible data storage solution with
various deployment options and capabilities.
Both Redis and Memcached are excellent open-source caching solutions with their
unique strengths. The choice between them depends on specific requirements and
use cases. Redis is suitable for scenarios requiring versatile data structures, persis‐
tence, pub/sub messaging, and advanced caching features. On the other hand, Memc‐
ached shines in simple, lightweight caching use cases that prioritize scalability and
ease of integration.
AWS offers Amazon Elasticache, compatible with both Redis and

Memcached for real-time use cases like caching, session stores,
gaming, geo-spatial services, real-time analytics, and queuing,
which we will cover in more detail in Chapter 10 - AWS Storage
Services.
Conclusion
In concluding this chapter on caching, we have journeyed through a comprehensive
exploration of the fundamental concepts and strategies that empower efficient data
caching. We’ve covered cache eviction policies, cache invalidation mechanisms, and
a plethora of caching strategies, equipping you with the knowledge to optimize data
access and storage. We’ve delved into caching deployment, understanding how strate‐
gic placement can maximize impact, and explored the diverse caching mechanisms
available. Additionally, we’ve touched upon Content Delivery Networks (CDNs) and
open-source caching solutions including Redis and Memcached, that offer robust
options for enhancing performance. By incorporating Redis or Memcached into your
architecture, you can significantly improve application performance, reduce response
times, and enhance the overall user experience by leveraging the power of in-memory
caching.
As we move forward in our exploration of enhancing system performance, the next
chapter will embark on an exploration of scaling and load balancing strategies. Scal‐
ing is a pivotal aspect of modern computing, allowing systems to handle increased
loads gracefully. We will also delve into strategies for load balancing in distributing
incoming traffic efficiently. Together, these topics will empower you to design and
maintain high-performing systems that can handle the demands of today’s dynamic
digital landscape.
Conclusion | 57
About the Authors
Jayanth Kumar is a Software Development Manager at Amazon, where he is
currently building large-scale software systems. Kumar is a millennial polymath,
published poet, certified AWS Solutions Architect Professional, entrepreneur, engi‐
neering leader and an assistant professor. He earned his bachelorâs degree from IIT
Bombay and his masterâs degree from UCLA, where he studied Multi-Objective
Deep Learning. He formerly held software engineering positions at SAP Germany
and Silicon Valley. Later, as an entrepreneur, he held the positions of Head of Engi‐
neering at Goodhealth and Engineering Manager at Delhivery, an Indian unicorn
company. He is always seeking a challenge and new opportunities for learning, and
focuses on building robust mechanisms and systems that will stand the test of time.
Mandeep Singh is a Software Engineer at Jupiter. He fell in love with Cloud, AWS
Technologies and Distributed Systems in the role of Big Data Cloud Support Asso‐
ciate at AWS and as a Software Engineer at Amazon. He supports the learning and
development of others by sharing lessons on Cloud and Distributed Systems on his
YouTube channel. He enjoys morning runs, weight lifting, cooking and spending time
with his family in the serene ambience of his hometown.

SystemDesign AWS Excerpt

Uploaded by

Copyright:

Available Formats

SystemDesign AWS Excerpt

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SystemDesign AWS Excerpt

Uploaded by

Copyright:

Available Formats

System Design

Jayanth Kumar &

Dragonfly Cloud is a fully managed, simple and cost-efficient

replacement for ElasticCache and other Redis based

Reduce Infrastructure Costs

Simplify Your Ops

START FREE: dragonflydb.io/cloud

With Early Release ebooks, you get books in their earliest

Jayanth Kumar and Mandeep Singh

Beijing Boston Farnham Sebastopol Tokyo

Acquisitions Editor: Megan Laddusaw Interior Designer: David Futato

October 2024: First Edition

Revision History for the Early Release

See http://oreilly.com/catalog/errata.csp?isbn=9781098146894 for release details.

1. System Design Trade-offs and Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4. Caching Policies and Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A Note for Early Release Readers

Today’s modern technological revolution is happening because of large scale software

System Design Concepts

6 | Chapter 1: System Design Trade-offs and Guidelines

We will cover the communication protocols as well as mechanisms

System Design Concepts | 7

8 | Chapter 1: System Design Trade-offs and Guidelines

Consistency in Distributed Systems

Consistency in Data Storage and Retrieval

System Design Concepts | 9

We will discuss some of the above techniques for ensuring Consis‐

Consistency Spectrum Model

10 | Chapter 1: System Design Trade-offs and Guidelines

We will cover the strong consistency requirements of relational

Monotonic read consistency

System Design Concepts | 11

In general, the consistency spectrum model provides a framework for understanding

12 | Chapter 1: System Design Trade-offs and Guidelines

Total Time − Sum total o f time system was down

Table 1-1. Availability Percentages Represented in 9s

System Design Concepts | 13

Availability in parallel vs in sequence

Figure 1-3. Sequential system vs Parallel system

14 | Chapter 1: System Design Trade-offs and Guidelines

Overall, the arrangement of components in a service can have a significant impact on

We will cover load balancing in detail in Chapter 5: Scaling

System Design Concepts | 15

Failover Patterns. Failover refers to the process of switching to a redundant or backup

16 | Chapter 1: System Design Trade-offs and Guidelines

These failover patterns are employed in relational datastores, non-

Replication Patterns. Replication is the process of maintaining multiple copies of data

System Design Concepts | 17

18 | Chapter 1: System Design Trade-offs and Guidelines

We will cover how relational and non-relational datastores ensure

M ean time between failures M TBF

Mean time to repair

System Design Concepts | 19

Total M aintainence Time

Reliability and Availability

Service level objectives and goals including Change Management,

20 | Chapter 1: System Design Trade-offs and Guidelines

Figure 1-6. Vertical Scaling vs Horizontal Scaling

System Design Concepts | 21

22 | Chapter 1: System Design Trade-offs and Guidelines