W Defa3473
W Defa3473
W Defa3473
Table of Contents
HIGHLIGHTS AND INTRODUCTION
03 Welcome Letter
Lindsay Smith, Senior Publications Manager at DZone
DZONE RESEARCH
ADDITIONAL RESOURCES
A state-of-the-art, high-powered jet couldn’t make it 10 While this all sounds great, actual implementation proves
feet off the ground without an adequate amount of fuel, more complicated, so we studied what this looks like across
assessment of landing gear, and everything else found on organizations by surveying the DZone audience.
the FAA’s preflight checklist. If one small piece of the larger
aircraft engine is missing or broken, the entire plane cannot Our findings were quite remarkable from last year’s survey of
fly. Or worse, there are serious consequences should the our performance experts — a lot can change in a year.
The same goes for software — the most sophisticated, reliability, and observability for distributed systems in our
cutting-edge application cannot run without the proper contributor insights, covering everything from patterns for
maintenance and monitoring processes in place. Application building distributed systems to developing your own SRE
performance and site reliability have become the wings to program. As always, DZone partners with an exceptional
testing, deployment, and maintenance of our software. group of expert contributors who really put these topics and
findings into perspective.
Because the performance and monitoring landscape
continues to evolve, so must our approach and technology. So, as you prepare your application for takeoff, make your
In this report, we wanted to assess the topic of application preflight checks, and ensure cabins are clear, be sure you
performance but take it one level deeper; we wanted to assess your dashboard and check telemetry data to ensure a
further explore the world of site reliability and observability smooth flight, and finally, let the following pages guide you
causes and specific problems in our code like CPU thrashing Lindsay Smith
DZone Publications
Meet the DZone Publications team! Publishing DZone Mission Statement
Refcards and Trend Reports year-round, this At DZone, we foster a collaborative environment that empowers
team can often be found reviewing and editing developers and tech professionals to share knowledge, build
contributor pieces, working with authors and skills, and solve problems through content, code, and community.
sponsors, and coordinating with designers. Part
We thoughtfully — and with intention — challenge the status
of their everyday includes collaborating across
quo and value diverse perspectives so that, as one, we can inspire
DZone's Production team to deliver high-quality
positive change through technology.
content to the DZone community.
Caitlin works with her team to develop and execute a Lauren identifies and implements areas of improvement
vision for DZone's content strategy as it pertains to DZone when it comes to authorship, article quality, content
Publications, Content, and Community. For Publications, coverage, and sponsored content on DZone.com. She
Caitlin oversees the creation and publication of all DZone also oversees our team of contract editors, which includes
Trend Reports and Refcards. She helps with topic selection recruiting, training, managing, and fostering an efficient and
and outline creation to ensure that the publications released collaborative work environment. When not working, Lauren
are highly curated and appeal to our developer audience. enjoys playing with her cats, Stella and Louie, reading, and
Outside of DZone, Caitlin enjoys running, DIYing, living near playing video games.
the beach, and exploring new restaurants near her home.
Lucy Marcum
Melissa Habit Publications Coordinator at DZone
Senior Publications Manager at DZone
@LucyMarcum on DZone
@dzone_melissah on DZone @lucy-marcum on LinkedIn
@melissahabit on LinkedIn
Lucy manages the Trend Report author experience,
Melissa leads the publication lifecycles of Trend Reports from sourcing new contributors to editing their articles
and Refcards — from overseeing workflows, research, and for publication, and she creates different Trend Report
design to collaborating with authors on content creation components such as the Diving Deeper and the Solutions
and reviews. Focused on overall Publications operations Directory. In addition, she assists with the author sourcing
and branding, she works cross-functionally to help foster an for and editing of Refcards. Outside of work, Lucy spends her
engaging learning experience for DZone readers. At home, time reading, writing, running, and trying to keep her cat,
Melissa passes the days reading, knitting, and adoring her Olive, out of trouble.
cats, Bean and Whitney.
Lindsay Smith
Senior Publications Manager at DZone
@DZone_LindsayS on DZone
@lindsaynicolesmith on LinkedIn
In September and October of 2022, DZone surveyed software developers, architects, and other IT professionals in order to
understand how applications are being designed, released, and tuned for performance. We also sought to explore trends from
data gathered in previous Trend Report surveys, most notably for our 2021 report on application performance management.
This year, we added new questions to our survey in order to gain more of an understanding around monitoring and
observability in system and web application performance.
Methods: We created a survey and distributed it to a global audience of software professionals. Question formats included
multiple choice, free response, and ranking. Survey links were distributed via email to an opt-in subscriber list, popups on
DZone.com, the DZone Core Slack workspace, and various DZone social media channels. The survey was open from September
13–October 3, 2022 and recorded 292 full and partial responses.
In this report, we review some of our key research findings. Many secondary findings of interest are not included here.
Research Target One: Understanding the Root Causes of Common Performance Issues
and Their Solutions
Motivations:
1. Discovering the causes and correlations around the most prevalent system and web application performance
issues and general developer attitudes toward common problems
2. Understanding how AI helps developers and site reliability engineers (SREs) discover and understand web
performance degradation
A major factor in having the ability to understand the root cause of performance issues is having the right tools in place to do
so (or tools to begin with). Only around two in five (39.4%) of respondents said their organization has implemented observability
and can quickly establish root cause and understand business impact. Without the organization supporting a DevOps team's
efforts to optimize performance with the tools and technologies to help the team discover why issues occur, many developers
are left frustrated with the lack of answers they're able to get — and the issues are that much more likely to happen again.
Of course, most developers do have an idea about what the underlying causes of their web performance issues are to some
extent. To discover common causes and correlations around the most prevalent web performance issues, we asked the
following question:
How often have you encountered the following root causes of web performance degradation?
Figure 1
Excessive algorithmic complexity due to bad code 23.0% 44.4% 28.2% 4.4%
Slow disk I/O due to bad code or configuration 20.6% 33.5% 41.5% 4.4%
Too many disk I/O operations due to bad code or configuration 17.4% 46.2% 32.4% 4.0%
Observations:
1. CPU thrashing is the most common reason for declines in web performance, with 26.4% of respondents saying they've
encountered this as a root cause of web performance degradation. In 2021, however, only 15.7% said CPU thrashing
is often the root cause. Last year, respondents said that they encountered high CPU load, memory exhaustion, I/O or
network bottlenecks, or excessive disk I/O operations or algorithmic complexity due to bad code all more often than CPU
thrashing. Last year, 35.3% of respondents said CPU thrashing is rarely the cause of their web performance degradation;
this year, that number was just 15.5%.
Why did that number more than double? Some developers may be misusing the resources at their disposal, although
through no fault of their own. CPU thrashing can indicate a misuse of resources since the computer's memory becomes
overwhelmed. As users and developers move forward into a world where online performance has spiked in importance
due to the growth of digital, it will only become more crucial for servers to be able to handle a high level of complexity.
2. Bad code is another common root cause of web performance degradation. 67.4% of respondents said that excessive
algorithmic complexity due to bad code is often or sometimes the root cause of web performance degradation,
compared to just 61.2% last year.
As algorithms become more complicated, ensuring that your code is accurate and optimized will be critical to
maintaining optimal web performance. Last year, just 12.2% of developers said that slow disk I/O due to bad code
or configuration was often the root cause of their web performance degradation. This year, that number has
grown to 20.6%.
We asked:
Table 1
that teams may not have been able to glean without it, Slow disk read/write 7 806 148
at least not without spending a significant number of
Slow GPU 8 688 132
hours doing so.
Slow I/O 9 662 138
As the capabilities of artificial intelligence continue
to expand, entirely new ways of understanding Other 10 231 113
As we'll explore in a later section, different organizations operate at varied levels of observational maturity as far as being
able to understand why performance issues occur. However, most organizations that leverage AI to assist their performance
optimization efforts aren't proactively using the capabilities. In fact, only 3.5% of developers said their organization leverages
AIOps to proactively prevent issues.
We wanted to understand what the organizations that do utilize AI for performance are using it for. Specifically, we wanted to
know how artificial intelligence is being leveraged for highly accurate monitoring and observability purposes. So we asked:
In what ways is your organization adopting AI for monitoring and observability? Select all that apply.
Results:
Figure 2
40
30
20
10
0
Anomaly Causality Correlation Historical Performance My organization
detection determination and analysis analysis doesn’t use AI
contextualization for monitoring
and observabilty
2. It's one thing to have data to analyze — it's another to know how to understand it. Most developers use AI after the fact
as a tool to analyze issues and determine why they occurred. That being said, it comes as no surprise that correlation and
contextualization make up the top way that developers use AI for monitoring and observability (37.2%), followed closely by
causality determination (36.8%).
3. Historical analysis, which can be an incredibly valuable tool to proactively work against performance issues (saving
organizations a lot of time, headaches, and money) comes not far behind. 34.5% said that their organization leverages AI
for historical analysis in relation to monitoring and observability, followed by performance analysis (32.2%).
2. Exploring the tools and performance patterns that software professionals successfully implement for monitoring and
observability
3. Comprehending the most common solutions that developers and SREs implement for web and application performance
How would you categorize your organization's state of observability maturity? We have implemented...
Results:
Table 2
We have implemented... % n=
Observability, and we can quickly establish root cause and understand business impact 39.4% 102
Observability, and we use AIOps capabilities to proactively prevent occurrences of issues 3.5% 9
Observations:
1. 73.8% of respondents said that they have implemented observability, while 25.1% are still in the monitoring stage.
Observability maturity will be crucial to organizational growth as the prevalence of performance issues continues to rise.
In another question, we asked developers whether their organization currently utilizes any observability tools. While
nearly three in four respondents said that they have implemented observability, just 51% said that their organization
does use observability tools. According to our research, 32% of organizations are currently in the observability tool
consideration stage.
Opposite of a world where developers must spend hours and hours diving into a performance issue to identify the
root cause and understand why a component isn't working, implementing an observability tool can help teams
quickly tackle these tasks to prevent future occurrences instead of continuing to experience the same issue without
a known cause.
3. As far as the specific tools that organizations use, of the respondents who plan to or currently leverage observability
tools, the majority (81.7%) use as-a-Service tools as opposed to self-managed. Most run observability tools as a service
in a private cloud (31.0%), followed by on-premises (26.8%) and in a public cloud (23.9%). We see similar trends among
respondents whose tools are self-managed, with the majority of these users running their tool on-premises (11.7%)
followed by private and public clouds, each at 3.3%.
In order to understand what functionality is most important to developers in choosing an observability tool, we asked:
When it comes to selecting observability tools, rank the following capabilities from most important (top) to least
important (bottom).
Results:
Table 3
Other 7 118 95
Observations:
1. When it comes to observability tool adoption, it's crucial for developers that the tool can integrate with existing tools
and/or their current tech stack, which ranked as the number one capability overall.
2. According to our research, the tool's cost is the second-to-least important consideration, revealing that developers are
willing to pay a price to get the right tool. Other top functionality include maturity of features and complexity of use.
Good things usually don't come easily, though. Developers face their fair share of challenges in adopting observability practices.
And in order to further understand the full scope of these challenges in relation to tool adoption, we also asked:
In your opinion, what are your organization's greatest challenges in adopting observability projects, tools, and practices?
Select all that apply.
Figure 3
50
40
30
20
10
0
Complexity Lack of Lack of Limited No clear Organization Teams Reduces Other -
knowledge/ leadership resources strategy and the using cost write in
skills creation of multiple
team silos tools
Observations:
1. 43.4% of respondents named a lack of knowledge and skills as the greatest challenge for their organization in adopting
observability projects, tools, and practices. 39.1% said that they have limited resources to implement observability, and
34.9% said that a lack of leadership is a challenge.
2. 22.1% reported that teams using multiple tools is a challenge their organization faces when implementing observability.
Cross-departmental alignment — and even cross-dev team alignment — remains a key sticking point for many
organizations. This can be an issue not just for gaining a comprehensive understanding of a system's state but also for
organizational efficiency and avoiding double-dipping into budgets. More cross-team alignment about the tools being
used makes things easier for everyone. That 34.9% of respondents who said that lack of leadership is a challenge in their
organization's adoption of observability practices could play a part here.
3. According to another survey question that we asked, only 3.9% of respondents said that their organization doesn't
conduct observability work. The building and testing stages are the most common parts of the software development
lifecycle (SDLC) where developers conduct observability work (52.7% and 54.3%, respectively). At all other stages of the
SDLC, respondents reported conducting observability work during planning (30.6%), design (42.6%), deployment (35.3%),
and maintenance (26.7%).
For developers, observability can provide a valuable sense of relief. 45.5% of respondents said that observability helps them detect
hard-to-catch problems, 38.7% said it enables more reliable development, and 36% said it builds confidence. The benefits aren't
just emotional — nearly half (49.4%) reported that observability increases automation, 37.5% said it leads to more innovation, and
21.3% said it reduces cost. These benefits strongly outweigh the downsides of lackluster organizational alignment.
A community that understands the field of web performance and observability can be a great resource for developers. A Center
of Excellence (CoE) for performance and reliability can provide shared resources and highly valuable, exclusive insights from
developers who are subject matter experts in the field. 60.7% of respondents say their organization has a Center of Excellence
for performance and reliability.
Definitions:
• Express Train – for some tasks, create an alternate path that does only the minimal required work (e.g., for data
fetches requiring maximum performance, create multiple DAOs — some enriched, some impoverished)
• Hard Sequence – enforce sequential completion of high-priority tasks, even if multiple threads are available (e.g., chain
Ajax calls enabling optional interactions only after minimal page load, even if later calls do not physically depend on
earlier returns)
• Batching – chunk similar tasks together to avoid spin-up/spin-down overhead (e.g., create a service that monitors a
queue of vectorizable tasks and groups them into one vectorized process once some threshold count is reached)
• Interface Matching – for tasks commonly accessed through a particular interface, create a coarse-grained object that
combines anything required for tasks defined by the interface (e.g., for an e-commerce cart, create a CartDecorated
object that handles all cart-related calculations and initializes with all data required for these calculations)
• Copy-Merge – for tasks using the same physical resource, distribute multiple physical copies of the resource and
reconcile in a separate process if needed (e.g., database sharding)
• Calendar Control – when workload timing can be predicted, block predicted work times from schedulers so that
simultaneous demand does not exceed available resources
Results, 2022:
Figure 4
Results, comparing 2022 vs. 2021 for respondents who implemented every performance pattern:
Table 4
2. Batching is still the most common software performance pattern, with 98.1% of respondents saying they implement it.
However, just 62.4% of respondents said they implement batching often or sometimes compared to 78% saying they
implement express train often or sometimes. In fact, in 2022, 35.7% of respondents said they rarely implement batching
compared to express train at 17.1%. Last year, 75.1% of respondents said they implement batching often or sometimes, with
just 18.1% saying they did so rarely.
The nearly twofold increase in respondents saying they rarely implement batching may be because batching is one
of the broadest and least complex software performance patterns. It is similar to basic principles like data locality and
is one of the first patterns that most developers learn. Express train, on the other hand, involves creating an alternate
path that does only the minimal required work to achieve maximum performance.
3. The performance pattern that saw the largest increase in overall usage is hard sequence, which enforces the sequential
completion of high-priority tasks, even if multiple threads are available. The significant increase in this could be due
to hard sequence being a more efficient option. Hard sequence allows SREs to choose to force a process to happen
sequentially rather than submit tasks concurrently and let them complete in whatever order they finish on the server.
2. Learning how developers and SREs measure and manage performance through service-level objectives (SLOs)
APPROACHES TO SELF-HEALING
Self-healing is important in modern software development for three primary reasons:
1. Even simple programs written in modern high-level languages can quickly explode in complexity beyond the
human ability to understand execution (or else formal verification would be trivial for most programs, which is
far from the case). This means that reliably doing computational work requires robustness against unimagined
execution sequences.
2. Core multiplication, branch prediction (and related machine-level optimizations), and horizontal worker scaling — often
over the internet — have borne the brunt of the modern extension of Moore's Law. This means that communication
channels have high cardinality, grow rapidly, may not be deterministic, and are likely to encounter unpredictable
partitioning. A lot of low-level work is handled invisibly to the programmer, which means that modern programs must
be able to respond increasingly well to situations that were not considered the application's own formal design.
3. Mobile and IoT devices are increasingly dominant platforms — and they are constantly losing radio connections to the
internet. As demands for high bandwidth require higher-frequency radio signals, network partitioning worsens and
becomes harder to hand-wave away.
We wanted to know what techniques software professionals are using for automatic recovery from application failure, so
we asked:
Which of the following approaches to self-healing have you implemented? Select all that apply.
Results:
Backoff
Chaos engineer
Circuit breaker
Client throttling
Compensating
transactions
Critical resource
isolation
Database
snapshotting
Failover
Graceful degradation
Leader election
Long-running
transaction checkpoints
Queue-based
load leveling
Retry
None
0 10 20 30 40
Observations:
1. In 2021, the top self-healing approach implemented by developers was retry, coming in at 60%. However, this year, that
number has shrunk tremendously to just 12.3%. Instead, the number one self-healing approach that developers have
implemented is client throttling (35.4%), followed closely by database snapshotting (35.0%).
2. Additionally, failover was named as a primary self-healing approach by 53.8% of developers and circuit breaker was named
by 46.5%. The main takeaway is that this year, results are much more spread out among approaches. Circuit breaker,
client throttling, critical resource isolation, database snapshotting, and failover were all ranked between 30-35.4% and
make up the top five responses.
As expectations grow for developers' ability to create programs that never fail (even without the full context of what
the program is supposed to do), methods for self-healing are all only growing in importance and prevalence, and it's
clear that developers must be open to trying new approaches to self-healing.
Which of the following metrics does your organization use to measure performance? Select all that apply.
Results:
Any garbage
collection metric
Apdex score
Average database
query response time
Average server
response time
Concurrent users
CPU usage
Error rate
Longest running
processes
Number of autoscaled
application instances
Request rate
Uptime
User satisfaction
Other - write in
0 10 20 30 40 50
Observation: CPU usage is the most common measurement method, with 43.1% of respondents saying their organization
measures performance this way, followed by average server response time at 40.1%. 37.8% said they measure performance
based on average database query response time, and 36.3% said they do so based on the number of concurrent users. Just
14.2% of respondents said that their organization measures performance based on user satisfaction, and 24.7% based on error
rate. Monitoring is a critical way that developers identify issues, namely infrastructure monitoring, log monitoring, and end-
user monitoring.
Service level objectives are agreements that are made, for example, by a DevOps team to a business customer about specific
uptimes or response times that are expected to be maintained. SLOs can cause strife for many developers because they can
be vague, unrealistic, or difficult to measure. SLOs can be helpful, however, when they are tied directly to the business goal. To
learn how developers are defining success, we first asked:
Results:
9.6%
Yes
33.1%
No
I don’t know
57.3%
Observation: The number of respondents who reported that their organization does not use SLOs has grown significantly
over the past year. In fact, 38.7% of respondents to our 2021 survey said that their organization has SLOs, compared to 57.3%
this year. Also significant to note is that last year, 31.5% of respondents said that they didn't know whether their organization
had SLOs. Last year, results were nearly split evenly between respondents saying their organization does use (29.8%), their
organization doesn't use (38.7%), or they don't know if their organization uses SLOs (31.5%). This indicated that perhaps SLOs
didn't contribute to these respondents' decision-making.
However, this year, many of the respondents who previously indicated that they didn't know if their organization implemented
SLOs moved into the "does not use" bucket. This year, only 9.6% of respondents said that they don't know if their organization
has SLOs. This suggests that more developers may be aware of these performance tracking metrics, but perhaps the metrics
haven't proved to be as fruitful as they could be and, thus, aren't worth spending time on.
To understand exactly how success is measured according to organizations' SLOs, we then asked a free-response question:
"What SLOs does your organization have?" For many respondents, mean time to resolution (MTTR) was one of the top tracked
objectives. This is also the primary metric that organizations look at in order to define success. SLOs can often be project-
specific (as many of our respondents also noted), but it’s refreshing to see the same metric staying of high importance across
both clients and organizations.
Further, we wanted to understand not only the metrics that are important to customers but also the metrics that are
important to the organization providing the service. So we asked:
What metrics does your organization use to define success? This can be at the individual or team level. Select all that apply.
Observation: MTTR is by far the number one metric that METRICS USED TO MEASURE SUCCESS
organizations use to define success, with 64% of respondents
% n=
saying their organization measures it. From the end user's
perspective, the most important factor aside from completely Frequency of deploys 34.5% 89
avoiding performance issues is to resolve them quickly. Mean time
Number of incidents 40.3% 104
between failures falls second (40.7%), nearly tied with number of
incidents (40.3%). In other words, failure frequency in general is a Mean time to resolve 64.0% 165
top metric that organizations track to determine success. Mean time between failure 40.7% 105
Frequency of deploys is still a top tracked metric, with 34.5% of Revenue 20.2% 52
respondents saying their organization uses this metric to define We don't use any metrics 5.4% 14
success. Organizations should be careful to rely too heavily on this
metric, as quantity shouldn't be favored over quality.
We have begun to address this topic in our research but should note that this survey included questions that we did not have
room to analyze here, including:
• The factors that developers most often blame for poor performance
• Self-managed vs. as-a-Service vs. cloud tool management
• How observability fits into the different stages of the software development lifecycle
We intend to analyze this data in future publications. Please contact [email protected] if you would like to discuss any
of our findings or supplementary data.
Sarah is a writer, researcher, and marketer who regularly analyzes survey data to provide valuable, easily
digestible insights to curious readers. She currently works in B2B content marketing but previously
worked for DZone for three years as an editor and publications manager. Outside of work, Sarah enjoys
being around nature, reading books, going to yoga classes, and spending time with her cat, Charlie.
Performance Engineering
Powered by Machine Learning
By Joana Carvalho, Performance Engineer at Postman
Software testing is straightforward — every input => known output. However, historically, a great deal of testing has been
guesswork. We create user journeys, estimate load and think time, run tests, and compare the current result with the baseline.
If we don't spot regressions, the build gets a thumbs up, and we move on. If there is a regression, back it goes. Most times,
we already know the output even though it needs to be better defined — less ambiguous with clear boundaries of where a
regression falls. Here is where machine learning (ML) systems and predictive analytics enter: to end ambiguity.
After tests finish, performance engineers do more than Figure 1: Overall confidence in performance metrics
look at the result averages and means; they will look
at percentages. For example, 10 percent of the slowest
requests are caused by a system bug that creates a
condition that always impacts speed.
Identify root causes. You can focus on other areas needing attention using machine learning techniques to identify root
causes for availability or performance problems. Predictive analytics can then analyze each cluster's various features, providing
insights into the changes we need to make to reach the ideal performance and avoid bottlenecks.
Monitor application health. Performing real-time application monitoring using machine-learning techniques allows
organizations to catch and respond to degradation promptly. Most applications rely on multiple services to get the complete
application's status; predictive analytics models will correlate and analyze the data when the application is healthy to identify
whether incoming data is an outlier.
Predict user load. We have relied on peak user traffic to size our infrastructure for the number of users accessing the
application in the future. This approach has limitations as it does not consider changes or other unknown factors. Predictive
analytics can help indicate the user load and better prepare to handle it, helping teams plan their infrastructure requirements
and capacity utilization.
Stop looking at thresholds and start analyzing data. Observability and monitoring generate large amounts of data that can
take up to several hundred megabytes a week. Even with modern analytic tools, you must know what you're looking for in
advance. This leads to teams not looking directly at the data but instead setting thresholds as triggers for action. Even mature
teams look for exceptions rather than diving into their data. To mitigate this, we integrate models with the available data
sources. The models will then sift through the data and calculate the thresholds over time. Using this technique, where models
are fed and aggregate historical data, provides thresholds based on seasonality rather than set by humans. Algorithm-set
thresholds trigger fewer alerts; however, these are far more actionable and valuable.
Analyze and correlate across datasets. Your data is mostly time series, making it easier to look at a single variable over time.
Many trends come from the interactions of multiple measures. For example, response time may drop only when various
transactions are made simultaneously with the same target. For a human, that's almost impossible, but properly trained
algorithms will spot these correlations.
For example, each project must use the same ranking system, so if one project uses 1 as critical and another uses 5 — like when
people use DEFCON 5 when they mean DEFCON 1 — the values must be normalized before processing. Predictive algorithms
are composed of the algorithm and the data it's fed, and software development generates immense amounts of data that,
until recently, sat idle, waiting to be deleted. However, predictive analytics algorithms can process those files, for patterns we
can't detect, to ask and answer questions based on that data, such as:
• Are we wasting time testing scenarios that aren't used?
• How do performance improvements correlate with user happiness?
• How long will it take to fix a specific defect?
These questions and their answers are what predictive analytics is used for — to better understand what is likely to happen.
THE ALGORITHMS
The other main component in predictive analysis is the algorithm; you'll want to select or implement it carefully. Starting
simple is vital as models tend to grow in complexity, becoming more sensitive to changes in the input data and distorting
predictions. They can solve two categories of problems: classification and regression (see Figure 2).
• Classification is used to forecast the result of a set by classifying it into categories starting by trying to infer labels from
the input data like "down" or "up."
• Regression is used to forecast the result of a set when the output variable is a set of real values. It will process input
data to predict, for example, the amount of memory used, the lines of code written by a developer, etc. The most used
prediction models are neural networks, decision trees, and linear and logistic regression.
DECISION TREES
Figure 3: Decision tree example
A decision tree is an analytics method that presents the
results in a series of if/then choices to forecast specific
options' potential risks and benefits. It can solve all
classification problems and answer complex issues.
Table 1
Used to define a value on a continuous It's a statistical method where the parameters are predicted based
range, such as the risk of user traffic peaks on older sets. It best suits binary classification: datasets where y = 0
in the following months. or 1, where 1 represents the default class. Its name derives from its
transformation function being a logistic function.
It's expressed as y = a + bx, where x is an It's expressed by the logistic function, p(x) = 1
input set used to determine the output y. ——————
Coefficients a and b are used to quantify 1 + e –(β0 + β1 x)
the relation between x and y, where a is where β0 is the intercept and β1 is the rate. It uses training data
the intercept and b is the slope of the line. to calculate the coefficients, minimizing the error between the
predicted and actual outcomes.
The goal is to fit a line nearest to most It forms an S-shaped curve where a threshold is applied to transform
points, reducing the distance or error the probability into a binary classification.
between y and the line.
These are supervised learning methods, as the algorithm solves for a specific property. Unsupervised learning is used when you
don't have a particular outcome in mind but want to identify possible patterns or trends. In this case, the model will analyze as
many combinations of features as possible to find correlations from which humans can act.
• Type of defect
• In what phase was the defect identified
• What the root cause of the defect is
• Whether the defect is reproducible
Once you understand this, you can make changes and create tests to prevent similar issues sooner.
Conclusion
Software engineers have made hundreds and thousands of assumptions since the dawn of programming. But digital users are
now more aware and have a lower tolerance for bugs and failures. Businesses are also competing to deliver a more engaging
and flawless user experience through tailored services and complex software that is becoming more difficult to test.
Today, everything needs to work seamlessly and support all popular browsers, mobile devices, and apps. A crash of even a few
minutes can cause a loss of thousands or millions of dollars. To prevent issues, teams must incorporate observability solutions
and user experience throughout the software lifecycle. Managing the quality and performance of complex systems requires
more than simply executing test cases and running load tests. Trends help you tell if a situation is under control, getting better,
or worsening — and how fast it improves or worsens. Machine learning techniques can help predict performance problems,
allowing teams to course correct. To quote Benjamin Franklin, "An ounce of prevention is worth a pound of cure."
Joana has been a performance engineer for the last 11 years. She analyzed root causes from user
interaction to bare metal, performance tuning, and new technology evaluation. Her goal is to create
solutions to empower the development teams to own performance investigation, visualization, and
reporting so that they can, in a self-sufficient manner, own the quality of their services. At Postman, she mainly implements
performance profiling, evaluation, analysis, and tuning.
Observability is trending, and for good reason. Modern observability solutions — focused on outcomes — drive innovation,
exceptional experience, and ultimately, competitive edge. Because engineers now need to focus more on making the systems
they build easier to observe, traditional monitoring software vendors (plus the technology industry analysts and influencers
advising them) have rushed to offer their takes on observability. It's not hard to understand why creating confusion is in their
interests. Don't be misled.
2. EPHEMERALITY
Tools from APM vendors weren't designed for dynamic environments. Yet containers are dynamic. There are so many of them
and containers may only live for a few minutes while VMs may exist for many months. Observability solutions maximize the
value of data in dynamic environments by providing flexibility and control of data for both short-and long-term use cases.
3. INTERDEPENDENCE
APM tools are good at handling potential, anticipated issues. Observability solutions are too, but they do so much more.
Relationships between apps and infrastructure are predictable for organizations running only monolithic apps and VMs.
Contrast those with relationships between microservices and containers in the cloud era that are much more fluid and
complex. With cloud environments, data cardinality is higher as well, making it much more challenging for teams to make
associations between applications, infrastructure, and business metrics. Observability connects the dots.
4. DATA FORMATS
Observability solutions ensure freedom of choice. APM tools lock users in because their appointed agents only ingest and store
data in proprietary formats that the vendors decide. And managing those silos inhibits collaboration while increasing costs.
With observability solutions, organizations get the compatibility with open-source standards and data ownership they want
and need. Teams can also share and access data across domains to better collaborate, which leads to faster detection and issue
resolution.
A Primer on Distributed
Systems Observability
By Boris Zaikin, Software & Cloud Architect at Nordcloud GmbH
In the past few years, the complexity of systems architectures drastically increased, especially in distributed, microservices-
based architectures. It is extremely hard and, in most cases, inefficient to debug and watch logs, particularly when we have
hundreds or even thousands of microservices or modules. In this article, I will describe what observability and monitoring
systems, the patterns of a good observability platform, and the observability subsystem may look like.
We know what functions should cover the observability system. Below we can see what information should be gathered to
properly design an observability and monitoring platform.
• Metrics – Data collection allows us to understand the application and infrastructure states — for example, latency and the
usage of CPU, memory, and storage.
• Distributed traces – Allows us to investigate the event or issue flow from one service to another.
• Logs – This is a message with a timestamp that contains information about application- or service-level errors, exceptions,
and information.
• Alerting – When an outage occurs or something goes wrong with one or several services, alerts notify these problems via
emails, SMS, chats, or calls to operators. This allows for quick action to fix the issue.
• Availability – Ensures that all services are up and running. The monitoring platform sends prob messages to some service
or component (to the HTTP API endpoint) to check if it responds. If not, then the observability system generates an alert
(see the bullet point for alerting).
Also, some observability and monitoring platforms may include user experience monitoring, such as heat maps and user
action recording.
Observability and monitoring follow the same principles and patterns and rely primarily on toolsets, so in my opinion, the
differentiation between the two is made for marketing purposes. There is no clear definition of how observability differs from
monitoring; all definitions are different and high-level.
Observability Patterns
All complex systems based on microservices have recommendations and patterns. This allows us to build a reliable system
without reinventing the wheel. Observability systems also have some essential patterns. The following sections discuss five of
the most important patterns.
To do this, you need to have some distributed system that will collect
and analyze all tracing data. Some open-source services allow you to
do so, such as Jaeger, OpenTelemetry, and OpenCensus. Check out
the Istio documentation for an example that demonstrates distributed
tracing in action.
• CPU
• Memory
• Disc use
• Services requests/response time
• Latency
Below is an example of the service that has a proxy agent. The proxy agent aggregates and sends telemetry data to the
observability platform.
In addition, the system can do the following actions to help the owner:
• Turn on/off the heating when people are about to arrive at the apartment.
• Notify, alert, or just ask if something requires human attention or if something is wrong.
In Figure 6, you can see an architecture that is based on the microservices pattern, since it serves best and represents all
system components. It contains main and observability subsystems. Each microservice is based on Azure Functions and
deployed to the Azure Kubernetes Cluster. We deploy functions to Kubernetes using the KEDA framework. KEDA is an open-
source, Kubernetes-based event autoscaling that allows us to automatically deploy and scale our microservices functions.
Also, KEDA provides the tools to wrap functions to the Docker containers. We can also deploy microservices functions directly
without KEDA and Kubernetes if we don't have a massive load and don't need the scaling options. The architecture contains
the following components that represent the main subsystem:
The essential part here is an observability subsystem. A variety of components and tools represent it. I've described all
components in Table 1 below:
Table 1
Tool Description
Prometheus Prometheus is an open-source framework to collect and store logs and telemetry as time series data. Also, it provides
alerting logic. Prometheus proxy or sidecar integrates with each microservice to collect all logs, telemetry, and tracing data.
Grafana Loki Grafana Loki is an open-source distributed log aggregation service. It's based on a labeling algorithm. It's not indexing
the logs; rather, it's assigning labels to each log domain, subsystem, or category.
Jaeger Jaeger is an open-source framework for distributed tracing in microservices-based systems. It also provides search and
data visualization options. Some of the high-level use cases of Jaeger include:
1. Performance and latency optimization
2. Distributed transaction monitoring
3. Service dependency analysis
4. Distributed context propagation
5. Root cause analysis
Grafana (Azure Grafana is also an open-source data visualization and analytics system. It allows the collection of traces, logs, and other
Managed telemetry data from different sources. We are using Grafana as a primary UI "control plane" to build and visualize data
Grafana) dashboards that will come from Prometheus, Loki, and Grafana Loki sources.
We can also use OpenTelemetry (OTel) framework. OTel is an open-source framework that was created, developed, and
supported by the Cloud Native Computing Foundation (CNCF). The idea is to create a standardized vendor-free observability
language specification, API, and tool. It is intended to collect, transform, and export telemetry data. Our architecture is based
on the Azure cloud, and we can enable OpenTelemetry for our infrastructure and application components. Below you can see
how our architecture can change with OpenTelemetry.
It is also worth mentioning that we do not necessarily need to add OTel, as it may add additional complexity to the system. In
the figure above, you can see that we need to forward all logs from Prometheus to OTel. Also, we can use Jaeger as a backend
service for OTel. Grafana Loki and Grafana will get data from OTel.
Conclusion
In this article, we demystified observability and monitoring terms, and we walked through examples of microservices
architecture with observability subsystems that can be used not only with Azure but also with other cloud providers. Also,
we defined the main difference between monitoring and observability, and we walked through essential monitoring and
observability patterns and toolsets. Developers and architects should understand that an observability/monitoring platform is a
tooling or a technical solution that allows teams to actively debug their system.
I'm a certified senior software and cloud architect who has solid experience designing and developing
complex solutions based on the Azure, Google, and AWS clouds. I have expertise in building distributed
systems and frameworks based on Kubernetes and Azure Service Fabric. My areas of interest include
enterprise cloud solutions, edge computing, high load applications, multitenant distributed systems, and IoT solutions.
Challenge COMPANY
Banks rely on high availability to provide customers 24x7 access to their Nationale-Nederlanden Bank
funds. With an uptime of 97.57 percent and a 4-6 hour mean time to repair
COMPANY SIZE
(MTTR) per incident, Nationale-Nederlanden Bank (NN Bank) was off its SLO
10,000+ employees
— negatively impacting customer experience and customer NPS.
INDUSTRY
The problem was it took too long to identify the root cause of outages. NN
Banking/Financial Services
Bank had 20+ IT teams, using several monitoring solutions. Data from these
systems was forwarded to a central data lake which did not correlate the data,
PRODUCTS USED
nor show how all the systems and their components were interrelated. When
StackState, Splunk, SolarWinds,
an issue occurred, root cause analysis (RCA) was challenging.
Prometheus, AWS
Given NN Bank's dynamic, hybrid environment, it was clear that the IT teams
PRIMARY OUTCOME
needed a way to quickly integrate massive amounts of data from siloed
By implementing observability, NN
systems, correlate it, and get a unified view of the overall IT environment.
Bank sped up the identification of root
cause, decreased MTTR, and — most
Solution
importantly — increased customer
StackState auto-discovered the bank's IT environment, generating a visual
satisfaction.
topology and mapping dependencies between components. StackState
also tracks changes over time, in real-time. The benefits of this functionality
were substantive. For NN Bank, auto-discovery of the IT environment was key.
"If you have ever had to do a root cause
Teams were able to see the full IT stack, which enabled them to determine
analysis, you know it’s not so easy.
RCA more quickly.
Management often tells you it should
By implementing the StackState platform, NN Bank was able to pinpoint be and asks, ‘Why are you taking such a
where events were occurring and instantly visualize upstream and long time?’ But it’s really difficult."
downstream dependencies. As a result, the bank achieved targeted insights
— Scrum Master
to help focus remediation efforts. The bank's IT teams eliminated time-
Platform Services,
consuming discussions around what was happening, where, what caused it,
NN Bank
and the systems impacted. Instead, everyone had a single view into where a
problem was occurring and how it impacted connected systems.
Distributed tracing, as the name suggests, is a method of tracking requests as it flows through distributed applications. Along
with logs and metrics, distributed tracing makes up the three pillars of observability. While all three signals are important to
determine the health of the overall system, distributed tracing has seen significant growth and adoption in recent years.
That's because traces are a powerful diagnostic tool to paint how requests propagate between services and uncover issues
along the boundaries. As the number of microservices grows, the complexity in observing the entire lifespan of requests
inevitably increases as well. Logs and metrics can certainly help with debugging issues stemming from a single service, but
distributed tracing will tie contextual information from all the services and surface the underlying issue.
Instrumenting for observability is an ongoing challenge for any enterprise as the software landscape continues to evolve.
Fortunately, distributed tracing provides the visibility companies need to operate in a growing microservice ecosystem. In
this article, we'll dive deep into the components of distributed traces, reasons to use distributed tracing, considerations for
implementing it, as well as an overview of the popular tools in the market today.
• Spans – smallest unit of work captured in observing a request (e.g., API call, database query)
• Traces – a collection of one or more spans
• Tags – metadata associated with a span (e.g., userId, resourceName)
To illustrate, let's walk through a distributed tracing scenario for a system with a front end, simple web server, and a database.
Tracing begins when a request is initiated (e.g., clicking a button, submitting a form, etc.). This creates a new trace with a
unique ID and the top-level span. As the request propagates to a new service, a child span is created. In our example, this would
happen as the request hits the web server and when a query to the database is made. At each step, various metadata is also
logged and tied to the span as well as the top-level trace.
Once all the work is complete for the corresponding request, all of the spans are aggregated with associated tags to assemble
the trace. This provides a view of the system, following the lifecycle of a request. This aggregated data is usually presented as a
flame graph with nested spans over time.
However, in a microservices world, problems can occur not just inside a single application (which logs and metrics can reveal),
but also at the boundaries of those services. To respond to an incident or to debug a performance degradation, it's important to
understand how the requests are flowing through one service to another.
• Visualizing service relationships – By inspecting the spans within a trace from the flame graph, developers can map out
all the service calls and their request flow. This helps to paint a global picture of the system, providing contextual data to
identify bottlenecks or ramifications from design changes.
• Pinpointing issues faster – When the engineer on-call is paged from an incident, traces can quickly surface the issue
and lead to reduced mean time to detect (MTTD) and repair (MTTR). This is a big win for the developer experience while
maintaining SLA commitments.
• Isolating specific requests – Since traces document the entire lifecycle of a request, this information can be used to
isolate specific actions such as user behavior or business logic to investigate.
Despite these benefits, adoption numbers for distributed tracing pale in comparison to logging and metrics as distributed
tracing comes with its fair share of challenges. First off, distributed tracing is only useful for the components that it touches.
Some tracing tools or frameworks don't support automatic injection or some languages (especially front-end components).
This would result in missing data and added work to piece together the details. Also, depending on the application, tracing can
generate a significant amount of data. Dealing with the scale and surfacing the important signals can be a challenge.
• Automatic instrumentation – Most modern tracing tools support automatic injection of tracing capabilities without
significant modifications to the underlying codebase. Some languages or frameworks may not be fully supported in
some cases but opt for using automated tooling instead of wasting valuable developer time.
• Scalable data capture – To deal with massive amounts of tracing data, some tools opt to downsample, which may result
in missing or unrepresentative data. Choose tools that can handle the volume and intelligently surface important signals.
• Integrations – Traces are one part of the observability stack. Traces will be more useful if they can be easily tied to existing
logs or metrics for a comprehensive overview. The goal should be to leverage the power of tracing, alongside other
signals, to get to actionable insights and proactive solutions rather than collect data for retroactive analysis only.
Popular Tools
The original infrastructure for supporting internet-scale distributed tracing can be attributed to Dapper, Google's internal tool
announced in 2010. Since then, there's been a proliferation of both open-source and enterprise-grade SaaS tools in the market.
OPEN-SOURCE TOOLS
The open-source ecosystem for distributed tracing is fairly mature with a lot of the projects backed by large tech companies.
Each tool listed below supports most programming languages and flexible deployment options:
• OpenTelemetry – an industry-leading observability framework developed by the CNCF that aims to standardize how to
instrument and export telemetry data, including traces
Conclusion
Distributed tracing, when implemented properly with logs and metrics, can provide tremendous value in surfacing how
requests move in a complex, microservices-based system. Traces uncover performance bottlenecks and errors as requests
bounce from one service to another, mapping out a global view of the application. As the number of services grows alongside
the complexity that follows with it, a good distributed tracing system will become a necessity for any organization looking to
upgrade their observability platform.
While implementing tracing requires some planning, with a growing number of robust open-source and commercial tools
available, organizations can now easily adopt tracing without a significant engineering overhaul. Invest in a good distributed
tracing infrastructure to reduce MTTD/MTTR and improve the developer experience at your organization.
Yitaek Hwang is a software engineer at NYDIG working with blockchain technology. He often writes about
cloud, DevOps/SRE, and crypto topics.
Unify all traces, metrics Manage your data costs Take advantage of our free
and logs into a single pane and get a streamline view trial period and unlimited
of glass to easily derive of your systems by only seats. A tool for your
insights from complex ingesting the signal data entire team without
telemetry data. you need. paying extra.
Building an Open-Source
Observability Toolchain
By Sudip Sengupta, Principal Architect & Technical Writer at Javelynn
Open-source software (OSS) has had a profound impact on modern application delivery. It has transformed how we think
about collaboration, lowered the cost to maintain IT stacks, and spurred the creation of some of the most popular software
applications and platforms used today.
The observability landscape is no different. In fact, one could argue that open-source observability tools have been even more
transformative within the world of monitoring and debugging distributed systems. By making powerful tools available to
everyone — and allowing anyone to contribute to their core construct — open-source observability tools allow organizations
of all sizes to benefit from their powerful capabilities in detecting error-prone patterns and offering insights of a framework's
internal state.
In this article, we will discuss the benefits of building an open-source toolchain for the observability of distributed systems,
strategies to build an open-source observability framework, best practices in administering comprehensive observability, and
popular open-source tools.
• Centralized frameworks – These are typically designed for large enterprises that consume a lot of resources and need
to monitor numerous distributed systems at once. As such frameworks are supported by a lot of hardware and software,
they are expensive to set up and maintain.
• Decentralized frameworks – These frameworks are preferred for use cases that do not immediately require as much
equipment or training and that do require a lower up-front investment towards software licenses. As decentralized
frameworks aid collaboration and allow enterprises to customize source code to meet specific needs, these are
considered to be one of the popular choices when building an entire tech stack from scratch.
Many organizations use OSS because it’s free and easy to use, but there’s more to it than that. Open-source tools also offer
several advantages over proprietary solutions to monitor how your applications are performing. Beyond monitoring application
health, open-source observability tools enable developers to retrofit the system for ease of use, availability, and security. Using
OSS tools for observability offers numerous other benefits, including:
Some recommended strategies to build an open-source observability framework include proactive anomaly detection, time-
based event correlation, shift-left for security, and adopting the right tools.
Logs should also combine timestamps and sequential records of all cluster events. This is important because a time-series data
helps correlate events by pinpointing when something occurred, as well as the specific events preceding the incident.
Beyond identifying the root cause of issues, the toolchain should enrich endpoint-level event data through continuous
collection and aggregation of performance metrics. This data offers actionable insights to make distributed systems self-
healing, thereby eliminating manual overheads to detect and mitigate security and performance flaws.
Along with considerations for the observability components, consider what it means to observe a system by factoring in
scalability. For instance, observing a multi-cloud, geographically distributed setup would require niche platforms when
compared to monitoring a monolithic, single-cluster workload.
Some recommended best practices to enable effective observability in distributed systems include:
LogStash Log collection and • Native ELK stack integration for comprehensive observability Lacks content routing
aggregation • Offers filter plugins for the correlation, measurement, and capabilities
simulation of events in real time
• Supports multiple input types
• Persistent queues continue to collect data when nodes fail
Fluentd Collection, • Supports both active-active and active-passive node configurations Adds an intermediate
processing, and for availability and scalability layer between
exporting of logs • Inbuilt tagging and dynamic routing capabilities log sources and
destinations, eventually
• Offers numerous plugins to support data ingestion from slowing down the
multiple sources observability pipeline
• Supports seamless export of processed logs to different third-
party solutions
• Easy to install and use
Prometheus Monitoring and • PromQL query language offers flexibility and scalability in fetching Lacks long-term
with Grafana alerting and analyzing metric data metric data storage
• Combines high-end metric collection and feature-rich visualizations for historical and
contextual analysis
• Deep integration with cloud-native projects enables holistic
observability of distributed DevOps workflows
OpenTelemetry Observability • Requires no performance overhead for generation and Does not provide a
instrumentation management of observability data visualization layer
• Enables developers to switch to new back-end analysis tools by
using relevant exporters
• Requires minimal changes to the codebase
• Efficient utilization of agents and libraries for auto-instrumentation
of programming languages and frameworks
Summary
Observability is a multi-faceted undertaking that involves distributed, cross-functional teams to own different responsibilities
before they can trust the information presented through key indicators. Despite the challenges, observability is essential for
understanding the behavior of distributed systems. With the right open-source tools and practices, organizations can build an
open-source observability framework that ensures systems are fault tolerant, secure, and compliant. Open-source tools help
design a comprehensive platform that is flexible and customizable to an organization’s business objectives while benefiting
from the collective knowledge of the community.
Sudip Sengupta is a TOGAF Certified Solutions Architect with more than 17 years of experience working for
global majors such as CSC, Hewlett Packard Enterprise, and DXC Technology. Sudip now works as a full-
time tech writer, focusing on Cloud, DevOps, SaaS, and cybersecurity. When not writing or reading, he’s
likely on the squash court or playing chess.
Site reliability engineering (SRE) is the state of the art for ensuring services are reliable and perform well. SRE practices power
some of the most successful websites in the world. In this article, I'll discuss who site reliability engineers (SREs) are, what they
do, key philosophies shared by successful SRE teams, and how to start migrating your operations teams to the SRE model.
In reality, most people are not equally skilled at operations work and software engineering work. Acknowledging that different
people have different interests within the job family is likely the best way to build a happy team. Offering a mix of roles and job
descriptions is a good idea to attract a diverse mix of SRE talent to your team.
Depending on the size and maturity of the company, the roles of SRE
vary, but at most companies they are responsible for these elements:
architecture, deployment, operations, firefighting, and fixing.
ARCHITECT SERVICES
SREs understand how services actually operate in production, so they
are responsible for helping design and architect scalable and reliable
services. These decisions are generally sorted into design-related and
capacity-related decisions.
DESIGN CONSIDERATIONS
This aspect focuses on reviewing the design of new services and involves
answering questions like:
• Is a new service written in a way that works with our other services?
• Is it scalable?
• Can it run in multiple environments at the same time?
• How does it store data/state, and how is that synchronized across other environments/regions?
CAPACITY CONSIDERATIONS
In addition to the overall architecture, SREs are tasked with figuring out cost and capacity requirements. To determine these
requirements, questions like these are asked:
OPERATE SERVICES
Once the service has been designed, it must be deployed to production, and changes must be reviewed to ensure that those
changes meet architecture goals and service-level objectives.
DEPLOY SOFTWARE
This part of the job is less important in larger organizations that have adopted a mature CI/CD practice, but many organizations
are not yet there. SREs in these organizations are often responsible for the actual process of getting binaries into production,
performing a canary deployment or A/B test, routing traffic appropriately, warming up caches, etc. At organizations without CI/
CD, SREs will generally also write scripting or other automation to assist in this deployment process.
REVIEW CODE
SREs are often involved in the code review process for performance-critical sections of production applications as well as for
writing code to help automate parts of their role to remove toil (more on toil below). This code must be reviewed by other
SREs before it is adopted across the team. Additionally, when troubleshooting an on-call issue, a good SRE can identify faulty
application code as part of the escalation flow or even fix it themselves.
FIREFIGHT
While not glamorous, firefighting is a signature part of the role of an SRE. SREs are an early escalation target when issues are
identified by an observability or monitoring system, and SREs are generally responsible for answering calls about service issues
24/7. Answering one of these calls is a combination of thrilling and terrifying: thrilling because your adrenaline starts to kick in
and you are "saving the day" — terrifying because every second that the problem isn't fixed, your customers are unhappy. SREs
answering on-call pages must identify the problem, find the problem in a very complicated system, and then fix the problem
either on their own or by engaging a software engineer.
For each on-call incident, SREs must identify that an issue exists using metrics, find the service causing the issue using traces,
then identify the cause of the issue using logs.
SRE Philosophies
One of the most common questions asked is how SREs differ from other operations roles. This is best illustrated through SRE
philosophies, the most prevalent of which are listed below. While any operations role will likely embrace at least some of these,
only SREs embrace them all.
– One philosophy around toil held by many SREs is to try to "automate yourself out of a job" (though there will always
be new services to work on, so you never quite get there).
– A good SRE will work to have the application's deployment fully automated so that infrastructure and code are
stored in the same repositories and deploy at the same time, meaning that if the entire existing infrastructure was
blown away, the application could be brought back up easily.
A good way to start this process is to consider migrating your legacy monitoring to observability. For most organizations, this
involves instrumenting their applications to emit metrics, traces, and logs to a centralized system that can use AI to identify root
Conclusion
SRE, traditionally, merges application developers with operations engineers to create a hybrid superhuman role that can do
anything. SREs are difficult to hire and retain, so it's important to embrace as much of the SRE philosophy as possible. By
starting small with one app or part of your infrastructure, you can ease the pain associated with changing how you develop
and deploy your application. The benefits gained by adopting these modern practices have real business value and will enable
you to be successful for years to come.
Resources:
Greg Leffler heads the Observability Practitioner team at Splunk and is on a mission to spread the
good word of observability to the world. Greg's career has taken him from the NOC to SRE, from SRE
to management, with side stops in security and editorial functions. Greg has experience as a systems
administrator at eBay Ads, and as an SRE and SRE Senior Manager at LinkedIn.
Site reliability engineering aims to keep servers and services running with zero downtime. However, outages and incidents are
inevitable, especially when dealing with a complex system that constantly gets new updates. Every company has a relatively
similar process to manage incidents, mitigate risks, and analyze root causes. This can be considered an opportunity to identify
issues and prevent them from happening, but not every company is successful at making it a constructive process.
In this article, I will discuss the advantage of the blameless postmortem process and how it can be a culture of change in a
company — a culture for a better change and not to blame!
• Keep to a small group. Only related people from various roles and responsibilities are invited to this meeting. The group
stays small to ensure that the meeting will be short and productive.
• Start with facts. One important thing about this meeting is that there is no time for guessing. Instead, facts are shared
with the team to help people understand the issue and perhaps identify the root cause.
• Listen to stories. After highlighting the facts, there might be some extra discussion from team members who were
either involved in the incident process or might have some knowledge about that particular issue.
• Find out the reasons. Most of the time, the root cause is found before this meeting, but in cases where the root cause is
still unknown, there will be a discussion to plan for further investigations, perhaps involving a third party to help. However,
the incident might occur again since the root cause is not found yet, so extra measures will be taken to prepare for
possible incidents.
• Create action points. Depending on the outcome of the discussion, the actions will vary. If the root cause is known,
actions will be taken to avoid this incident. Otherwise, further investigations will be planned and assigned to a team to
find the root cause.
Sometimes a postmortem meeting turns into another retro in which team members start arguing with each other or discuss
issues that are not in the scope of the incident, resulting in people pointing at each other rather than discussing the root cause.
This damages the team morale, and such an unproductive manner leads to facing more failures in the future.
That's why it is essential to have a blameless postmortem meeting to ensure people feel comfortable sharing their opinions
and to focus on improving the process. Now the question is, what does a blameless postmortem look like? Here is my recipe to
arrange a productive blameless postmortem process.
DEFINE SOLUTIONS
If the root cause is known, you can plan with the team to implement a solution to prevent this incident from happening again.
If it is not known, it would be best to spend more time on the investigation to find the root cause and take extra measures or
workarounds to prepare for possible similar incidents.
• No postmortem is left unreviewed. Arranging regular review sessions helps to look into outstanding postmortems and
close the discussions, collect ideas, and draw actions. As a result, all postmortems are taken seriously and processed.
• Introduce a postmortem culture. Using a collaborative approach with teams helps introduce postmortem culture to an
organization easier and faster by providing various programs, including:
– Postmortem of the month: This event motivates teams to conduct a better postmortem process. So every month, the
best and most well-written postmortem will be shared with the rest of the organization.
– Postmortem reading clubs: Regular sessions are conducted to review past postmortems. Engineers can see what
other teams faced in previous postmortems and learn from the lessons.
• Ask for feedback on postmortem effectiveness. From time to time, there is a survey for teams to share their
experiences and the feedback they have about the postmortem process. This helps evaluate the postmortem culture and
increase its effectiveness.
If you are interested in learning more about Google's postmortem culture, check out Chapter 15 of Google's book, Site
Reliability Engineering.
Conclusion
Site reliability engineers play an essential role in ensuring that systems are reliable, and keeping this reliability is a continuous
job. While developers are thinking of new features, SREs are thinking of a better and smoother process to release features.
Incidents are part of the software development lifecycle, but modern teams like SRE teams define processes to help turn those
incidents into opportunities to improve their systems. SREs know the importance of blameless postmortem meetings where
failures are accepted as part of development. That’s why they focus on reliability.
The future of incident management will be more automation and perhaps using artificial intelligence, where a system can fix
most of the issues itself. For now, SREs are using blameless postmortems to improve uptime, productivity, and the quality of
team relationships.
Alireza is a software engineer with more than 22 years of experience in software development. He started
his career as a software developer, and in recent years, he transitioned into DevOps practices. Currently,
he is helping companies and organizations move away from traditional development workflows and
embrace a DevOps culture. Additionally, Alireza is coaching organizations as Azure Specialists in their migration journey to
the public cloud.
ITOps, DevOps, AIOps - All Things Ops Getting Started With Prometheus
Host Elias Voelker interviews senior IT executives Prometheus has become the de facto standard for the
and thought leaders in this podcast that covers, monitoring and alerting of distributed systems and
as its name suggests, all things ops. From architecture. In this Refcard, we explore the core components
timely topics such as how ITOps and AIOps can of the Prometheus architecture and key concepts — then
recession-proof the cost of IT to discussions around uptime in focus on getting up and running with Prometheus, including
the context of site reliability engineering, this podcast will help configuration and both collecting and working with data.
your day-to-day IT infrastructure operation and management.
Observability Maturity Model: Essentials for Greater
OpenObservability Talks IT Reliability
As its name suggests, this podcast serves Modern systems and applications are increasingly more
to amplify the conversation on open-source dynamic, distributed, and modular in nature. To support their
technologies and advance observability efforts systems' availability and performance, ITOps and SRE teams
for DevOps. Listen to industry leaders and need advanced monitoring capabilities. This Refcard reviews
contributors to projects like OpenTelemetry and Jaeger the distinct levels of observability maturity, key functionality
discuss their use cases, best practices, and vision for the space. at each stage, and next steps organizations should take to
enhance their monitoring practices.
TestGuild Performance Testing and Site
Reliability Podcast Continuous Delivery Patterns and Anti-Patterns
Since 2019, TestGuild has brought us 100 and This Refcard explains detailed patterns and anti-patterns for
counting episodes that cover a wide range of core areas of continuous delivery, including the delivery and
performance-related topics. Tune in with Joe deployment phases, rollbacks, pipeline observability and
Colantonio to learn about chaos engineering, test automation, monitoring, documentation, as well as communication across
API load testing, monitoring, site reliability, and (much) more. teams and within the organization.
k6
TREND REPORTS
Grafana Labs' YouTube channel for k6 is the perfect rabbit
hole to lose yourself in. Filled with educational videos covering Application Performance Management
observability, performance testing, open-source tools, and DZone's 2021 APM Trend Report dives deeper into the
more, you can build your knowledge base for application management of application performance in distributed
performance and reliability. systems, including observability, intelligent monitoring, and
rapid, automated remediation. It also provides an overview of
DevOps Pulse 2022: Challenges to the Growing
how to choose an APM tool provider, common practices for
Advancement of Observability
self-healing, and how to manage pain points that distributed
DZone's webinar with Logz.io takes a deep dive into the cloud-based architectures cause.
data and key takeaways from the DevOps Pulse Report. The
discussion spans topics such as the increased observability DevOps: CI/CD and Application Release Orchestration
tool sprawl that is driving up complexity, how distributed In DZone's 2022 DevOps Trend Report, we provide insight into
tracing remains nascent, and how rising costs and data how CI/CD has revolutionized automated testing, offer advice
volumes are hindering observability strategies. on why an SRE is important to CI/CD, explore the differences
between managed and self-hosted CI/CD, and more. Our
What Do IT and Engineering Leaders Need to Know
goal is to offer guidance to our global audience of DevOps
About Observability
engineers, automation architects, and all those in between
In this DZone webinar, sit down with Observe Founder and
on how to best adopt DevOps practices to help scale the
VP of engineering Jacob Leverich and Redmonk co-founder
productivity of their teams.
James Governor as they discuss observability, how it increases
organizations' troubleshooting competency, and the impact
of containerization and serverless platforms on observability.
Solutions Directory
This directory contains performance and site reliability tools to assist with management, monitoring,
observability, testing, and tracing. It provides pricing data and product category information gathered
from vendor websites and project pages. Solutions are selected for inclusion based on several impartial
criteria, including solution maturity, technical innovativeness, relevance, and data availability.
Chronosphere Observability
Chronosphere Cloud-native observability By request chronosphere.io/platform
2022 PARTNERS
Platform
telemetryhub.com/products/
Scout APM TelemetryHub Full-stack observability Trial period
telemetryhub
StackState Observability
StackState Observability Trial period stackstate.com/platform/overview
Platform
Amazon Web
AWS X-Ray Distributed tracing Free tier aws.amazon.com/xray
Services
fortra.com/products/it-performance-
VCM Performance optimization Trial period
optimization-software
Fortra fortra.com/products/capacity-
Performance Navigator Performance monitoring Free tier planning-and-performance-
analysis-software
Application performance
FusionReactor FusionReactor APM Trial period fusion-reactor.com
management
Application resource
IBM Turbonomic Sandbox ibm.com/products/turbonomic
IBM management
inetco.com/products-and-services/
INETCO INETCO Insight Performance monitoring By request inetco-insight-for-payment-
transaction-monitoring
itrsgroup.com/products/trade-
ITRS Trade Analytics Trade infrastructure monitoring
analytics
Application performance
JenniferSoft JENNIFER Trial period jennifersoft.com
management
microfocus.com/en-us/products/
Operations Bridge AIOps
operations-bridge
microfocus.com/en-us/products/
Network performance
Network Node Manager i network-node-manager-i-network-
Micro Focus monitoring By request
management-software
microfocus.com/en-us/products/
Network Operations
Network management network-operations-management-
Management
suite
azure.microsoft.com/en-us/
Azure Monitor Observability
products/monitor
Microsoft Azure Free tier
Network performance azure.microsoft.com/en-us/
Network Watcher
monitoring products/network-watcher
poweradmin.com/products/server-
Power Admin Server Monitor Network monitoring By request
monitoring
riverbed.com/products/npm/
Alluvio NetProfiler Network traffic monitoring Trial period
netprofiler.html
Full-stack performance
AppOptics solarwinds.com/appoptics
monitoring
solarwinds.com/solarwinds-
SolarWinds Observability Observability
observability
Website performance
SpeedCurve SpeedCurve Trial period speedcurve.com
monitoring
splunk.com/en_us/products/
Splunk APM Observability apm-application-performance-
Trial period monitoring.html
splunk.com/en_us/products/it-
Splunk IT Service Intelligence AIOps By request
service-intelligence.html
End-to-end performance
Site24x7 Trial period site24x7.com
monitoring