What Is SRE
What Is SRE
What Is SRE
Is SRE?
An Introduction to
Site Reliability Engineering
REPORT
What Is SRE?
An Introduction to
Site Reliability Engineering
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. What Is SRE?, the
cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors, and do not represent the
publisher’s views. While the publisher and the authors have used good faith efforts
to ensure that the information and instructions contained in this work are accurate,
the publisher and the authors disclaim all responsibility for errors or omissions,
including without limitation responsibility for damages resulting from the use of or
reliance on this work. Use of the information and instructions contained in this
work is at your own risk. If any code samples or other technology this work contains
or describes is subject to open source licenses or the intellectual property rights of
others, it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
This work is part of a collaboration between O’Reilly and Verizon Digital Media. See
our statement of editorial independence.
978-1-492-05441-2
[LSI]
Table of Contents
1. Defining “SRE”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Digging into the Terms in These Definitions 3
Where Did SRE Come From? 7
What’s the Relationship Between SRE and DevOps? 9
How Do I Get My Company to “Do SRE”? 10
3. Implementing SRE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Hierarchy of Reliability 19
Starting a New Organization with SRE 22
Introducing SRE into an Existing Organization 24
Overlap Between Greenfield and Brownfield 25
A. Further Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iii
CHAPTER 1
Defining “SRE”
1 Hat tip to Laura Nolan for this wording. Also note that the skills and capabilities to
troubleshoot production problems and feed that learning back into making things bet‐
ter can and do exist in teams where reliability may be a shared mandate. The relative
balance of concerns between reliability and “other things” will affect the effectiveness of
the execution.
1
The use of service level indicators (SLIs) and service level objectives
(SLOs) as meaningful indicia of service health is one of the distin‐
guishing characteristics of SRE practice. It is important to recognize
that SLOs are symptoms of a healthy relationship between the relia‐
bility (SRE) team and the feature team, not a compliance exercise
dictated by management. In the pursuit of greater reliability, SREs
will focus on bringing as many components of the greater system
space as possible into a resilient, predictable, consistent, repeatable,
and measured state. Major areas of expertise can include:
• Release engineering
• Change management
• Monitoring and observability
• Managing and learning from incidents
• Self-service automation
• Troubleshooting
• Performance
• The use of deliberate adversity (chaos engineering)
2 Hat tip to David Blank-Edelman and the Azure SRE leadership team for this wording.
Data-Informed
It is critical that these feedback loops be automated in order to scale.
Scale is further enabled by relying on data rather than opinion.
Measurements are inevitably artifacts of their time and environ‐
ment, constrained by the technologies that are used to obtain them.
Changes in the environment or better understandings of the dynam‐
ics of a system can lead to valid technical arguments about whether
a measurement is accurate or effective in a particular context. Con‐
tinually improving the measurements to adequately inform product
decisions is one of the benefits of having a standing SRE team. As
noted by Lord Kelvin:
When you can measure what you are speaking about, and express it
in numbers, you know something about it; but when you cannot
measure it, when you cannot express it in numbers, your knowl‐
edge is of a meagre and unsatisfactory kind; it may be the begin‐
ning of knowledge, but you have scarcely, in your thoughts,
advanced to the stage of science, whatever the matter may be.
3 Interestingly, at around the same time the Greek philosopher Zeno posed his “Achilles
and the tortoise” paradox, which is an alternate formulation of the same puzzle.
Continuous Improvement
Especially in the consumer-facing online internet service world,
nothing stands still for long. Services add new features daily, if not
Organizational Model
Effective and successful teams don’t happen by accident. The discus‐
sions and agreements around SLOs take time and conscious effort to
negotiate and track. Keeping teams from being consumed by the
ever-increasing demands of users, developers, and services so that
they can do the necessary design and development to engineer solu‐
tions also requires a nontrivial organizational commitment to relia‐
bility and SRE.
Companies in which SRE teams are successful are ones which have
made reliability a priority. They staff their SRE team(s) appropri‐
ately to have sustainable on-call responsibilities and long-term engi‐
neering output. They support the engineering project work balance
of SRE teams by pushing back on the forces of entropy (interrup‐
tion) that would erode the team’s ability to have productive product
output.
Culture/Capabilities/Configuration
In Innovation Prowess (Wharton Digital Press), George S. Day, a
business professor at Wharton, identified a framework for the
underlying components in highly innovative companies. He classi‐
fied them into the “three C’s”:
11
[The] elements are mutually reinforcing. They don’t simply add
together; instead they are multiplicative, as a weakness in one
afflicts the others….
Culture underlies and infuses everything an organization
does….Culture and capabilities have a symbiotic relationship—one
can’t function without the other. They also have to be closely
aligned to get superior results.
Culture
As covered earlier, clear and unambiguous support for site reliability
is an absolute necessity for a successful SRE implementation. High
(enough) executive-level support can be shown through organiza‐
tional structures, but a culture that recognizes and rewards work that
enhances reliability provides the environment in which SRE can be
viable. When things are going badly, the organization must demon‐
strate its commitment to reliability by reallocating engineering
resources across the board to address the deficiencies. Other impor‐
tant cultural components include:
Capabilities
For SREs, the ideal team member has a broad understanding of
computer system dynamics—especially distributed systems. Effec‐
tive SREs are able to zoom out to deal with system interrelationships
and to zoom in, as needed, to debug the bit-level intricacies of net‐
working or memory usage patterns.
1 See “Carol Dweck: A Summary of the Two Mindsets and the Power of Believing That
You Can Improve” and her book Mindset: The New Psychology of Success (Ballantine
Books). Also see Peter Senge’s The Fifth Discipline: The Art and Practice of The Learning
Organization (Doubleday).
2 See Chapter 11 of Accelerate, by Nicole Forsgren, Gene Kim, and Jez Humble (IT Revo‐
lution Press).
3 See Project Aristotle and Chapter 27 of Seeking SRE by David Blank-Edelman
(O’Reilly).
Configuration
Finally, in the configuration space, a strong SRE practice requires
reporting structures that allow SREs to be evaluated and rewarded
according to the distinct measures of performance that matter the
most to them (not just how quickly features get shipped). Note that
these reporting structures may be local distinctions, matrix-based,
or a fully independent organization within engineering and still pro‐
vide effective evaluation and recognition incentives. SRE success can
be tracked across five areas that contribute to reliability:
Culture/Capabilities/Configuration | 13
• Improving velocity through reduction of toil and manual excep‐
tion handling4
• Effectively handling and learning from incidents
4 The antipattern version of this is referred to as “feeding the machine with the effort and
toil of humans.” Working in that way is not only inhumane but does not effectively
scale, because you simply can’t hire enough people to keep up with the demand—and it
would be prohibitively expensive.
5 These characterizations are necessarily simplified in the interest of being succinct.
There are many variations and a range of overlapping practice for all of the described
roles which can make it difficult to distinguish one from another.
When it comes time to implement a new SRE team, the main factor
that contributes to the plan is whether you are starting “fresh”—a
“greenfield” project—or taking a “brownfield” approach and migrat‐
ing an existing team. In either scenario, the amount of cultural
change needed can be daunting.
Even before a team is formed, one must prioritize the work to be
done. A guide for figuring out where to start is the Hierarchy of Reli‐
ability. Since this hierarchy is going to guide the future changes, let’s
start by explaining what it is.
Hierarchy of Reliability
In late 2013, an SRE from Google, Mikey Dickerson, was asked to
help the struggling HealthCare.gov. (To demonstrate the previous
terms, he stepped into a situation that had a number of pieces in
place, but the site as a whole was not functioning as desired. This is
a great example of a “brownfield” scenario.) There was some intense
time pressure to get things working quickly. He needed a way to
explain “reliability” in a simple and straightforward way, so he bor‐
rowed from a theory in psychology, Maslow’s Hierarchy of Needs.
The Hierarchy of Reliability that Dickerson used to help
HealthCare.gov is shown in Figure 3-1.
19
Figure 3-1. Hierarchy of Reliability
The idea is that topics at the bottom are more “basic,” and they grad‐
ually get more advanced as you progress up the pyramid. But each
topic (or “level,” as we will refer to them) is not exclusively depen‐
dent on the levels below it. Rather, they build on one another. When
each level is done well, then the other levels naturally benefit.1
As an extreme example, let’s look at the very bottom (“monitoring”)
and the very top (“product”). Obviously, your company could have a
product without monitoring. But nobody would know if, say, half of
your customers only saw error pages, or if they saw the product
(site) that you had designed.
While the levels used in Figure 3-1 are a proven set that a team can
use to prioritize work, there are two things that we want to add.
In Chapter 1, we mentioned how being “data-informed” was critical
to having valuable feedback loops. One of the key ways to gather
data is through the various metrics that your software produces.
While it is implied that (good) metrics are necessary for monitoring,
we do want to call this out explicitly to enforce its importance.
Figure 3-2 adds that “metrics” level.
1 For a more complete discussion of the pyramid, see Part III of the book Site Reliability
Engineering.
The other level to add has to do with the people that make up the
team. Being on call can be very stressful. Even if there are no issues
or outages, the people on call have to be available, which can impact
the quality time they spend with their families and friends. And
when an outage occurs, there is often immense pressure to get
things working as quickly as possible. This can lead to long hours
that drift into the early morning. Be aware of the impact this has on
the people that work in the team. We add that “softer” level of “life”
(food, sleep, family, etc.) in Figure 3-3.
Hierarchy of Reliability | 21
The resulting pyramid presents a solid guide for the work that needs
to get done to make a site (or system) reliable.
At around the same time, another team (now known as the founda‐
tion team) was chartered to develop consistent tooling and processes
for the engineering org to use as they developed the site’s codebase.
The foundation team focuses on engineering productivity, building
and supporting the development environment tooling from base
libraries, IDEs, and version control through the CI/CD pipelines
into production.
As the engineering organization has grown, the SRE and foundation
teams have also grown, with each now accounting for about 10% of
the total engineering headcount. As the scale of the problems
increased, so have the services and systems that are developed and
maintained by the SRE organization in order to keep up with the
demands of the site.
For a profession that has only been a named role for about 15 years,
SRE has grown into a significant force. Two SREs have even ended
up on the cover of Time magazine.1 Looking across a number of job
posting sites, there are thousands of open positions around the
world. Table 4-1 gives an idea of the numbers on some popular sites
as of January 2019.
27
job openings as well as median salary across the span of those three
reports (base salary increased from $140k in 2017 to $200k in 2019).
Another perspective on the growth of the profession can be seen in
Figure 4-1, which shows the attendance numbers for the USENIX
SREcon conference series that began in 2014.
Lex Neva’s newsletter SRE Weekly, which covers blog posts and other
online articles of interest to the profession, has seen similar growth
from its beginning in 2016 (Figure 4-2).
31
• Thinking that SRE is a point solution to a particular problem
rather than a fundamental cultural shift
This IS SRE
Hearkening back to the beginning:
SRE is an organizational model for running reliable online services
by teams that are chartered to do reliability-focused engineering
work.
As a discipline, SREs are devoted to helping an organization sus‐
tainably achieve the appropriate level of reliability for its services by
implementing and continually improving data-informed production
feedback loops to balance availability, performance, and agility.
Does it make sense for your company to commit heavily to reliabil‐
ity and pursue the implementation of SRE in your organization?
Only you and the other leaders in your company can answer that
question. Some companies will be at a size where having a distinct
organizational component or team just does not fit, but the princi‐
ples can be put in place to provide a foundation for the future.
Just like with any new methodology or cultural shift, when imple‐
menting SRE it will take time, grit, and humility to adjust to the
changing circumstances—but the payoff will be an institutionalized
commitment to the importance of the user’s interaction with your
site, service, system, or other “online stuff.” Over time, with the SRE
team(s) consistently representing reliability and operability con‐
cerns as well as actively contributing to the product codebase to
improve reliability, feature developers will learn to factor these
pieces into their plans as they develop new features. At that point,
SREs will be able to shift their impact to a deeper and wider level,
making next month’s problems different from today’s.
Our hope is that this brief introduction to Site Reliability Engineer‐
ing will have provided you with an effective understanding of the
what and how of SRE. There are lots of resources available to dive
into greater detail. We’ve listed some of the best starting points for
further reading in Appendix A.
33
About the Authors
Kurt Andersen is a part of the Product-SRE team at LinkedIn. He
has been one the co-chairs for SREcon Americas and has been active
in the anti-abuse community for over 15 years. He also works as one
of the program committee chairs for the Messaging, Malware and
Mobile Anti-Abuse Working Group (M3AAWG.org). Kurt has spo‐
ken around the world on various aspects of reliability, authentica‐
tion, anti-abuse, and security. He also works on internet standards
through the IETF and serves on the USENIX Board of Directors and
as liaison to the SREcon conferences worldwide.
Craig Sebenik is currently an SRE at Split Software. He has worked
at several startups over the years, and a few large, well-known com‐
panies (including LinkedIn and NetApp). He is the author of Salt
Essentials (O’Reilly) and has spoken at LISA, SREcon, and SaltConf.
Craig also has a passion for cooking. He earned Le Grand Diplôme
from Le Cordon Bleu, a master’s degree in Italian cuisine from Api‐
cius (Florence, Italy), and a master’s degree in gastronomy from the
University of Reims (France).
There’s much more
where this came from.
Experience books, videos, live online
training courses, and more from O’Reilly
and our 200+ partners—all in one place.