5853 Aiops Done Right Ebook Dynatrace
5853 Aiops Done Right Ebook Dynatrace
5853 Aiops Done Right Ebook Dynatrace
©2023 Dynatrace
What's inside
INTRODUCTION
The promise of AI
INTRODUCTION
CHAPTER 1
Auto-remediation
CHAPTER 6
1
AI Technologies — William Blair Industry Report, June 28, 2018
—Jeppe Hedesgaard Lindberg, Application Performance effort troubleshooting and more time innovating and achieving transformative business outcomes.
Manager, Coop Denmark
How to avoid closing down 500 supermarkets on a busy Saturday
“We fire up Dynatrace, and immediately the AI goes to
Coop, Denmark’s largest food retail group, celebrated its 150th anniversary by digitizing its business and
work and identifies problems. There’s no digging—it’s
moving 80% of its core apps into the cloud.
bubbling to the top. It’s right there in your face. It just
does it for you; it’s amazing.” In 2016, Coop launched its new customer loyalty solution and an updated point-of-sale software. Despite
—Steve Strout, Director, Platform Engineering, Assurant extensive testing, a problem developed soon after the launch in production—checkout registers froze when
trying to print out receipts. Suddenly, Coop was facing the prospect of having to close 500 of its stores on a
“The AI paves the way for autonomous operations,
busy Saturday morning because its payment systems were down.
enabling us to create auto-remediation workflows
that remove the need for human intervention in the However, two minutes after the first problems occurred in a couple of stores, Dynatrace monitoring software
resolution of recurring problems.” was able to pinpoint the root cause: a lack of CPU power in the Azure cloud. A major breakdown was avoided
by simply spinning up additional resources on the fly.
—David Shepherd, Service Delivery Manager, Global IT Service
Excellence, Experian
(MTTD). However, further reduction of MTTR requires automatic root-cause analysis. without Dynatrace
Number
of Calls
turns out to be a tricky task that requires advanced statistics like machine learning. 120
90%
However, even the best baselining methods prove to be inadequate when it comes Faster AFTER:
to the cloud. 40%
Number Faster with Dynatrace
of Calls
With modern microservices architectures, a single failure can impact a multitude and incidence response service
of connected services, which subsequently also fail. Therefore, a single problem -90% SLA
Calls
can trigger many alerts. This is called an alert storm or noisy alerts.
5 5 5 10 RESOLVE 35 RESTORE
After
(×) (×)
Conventional monitoring solutions fall short of resolving this issue. It remains up
120
to human operators to make sense of the alerts. Problem triage becomes a time-
consuming and often frustrating exercise involving war rooms and long hours.
The only way out is a reliable method for determining the underlying
root-cause automatically.
There are two very distinct AI-based approaches to reduce alert noise:
A large health
Deterministic AI using fault-tree analysis Statistical correlation-based AI
insurer with 350
hosts captures
900,000
events
per minute and
measures
safety engineering multidimensional model
• Works in near real time • Building the model takes time and lags
per minute. in dynamic environments
• Easily visualize results to pinpoint root
causes and help you understand impact • Requires human interpretation
to determine root causes
In an environment of disparate monitoring tools, operations Full system visibility is a necessary precondition for automating
personnel are left to make sense of multiple diverse inputs operations, including solid self-remediation. We need full insight
coming from various sources. This increases the likelihood of not only into the application—including containers and functions-
error in situational awareness and diagnosis. Currently, only 5% as-a-service—but also into all layers of the cloud infrastructure,
A big airline with
of applications are monitored.² The aim is to get full end-to-end networks, the CI/CD pipeline, and the real user experience. In many 2,500 hosts has
visibility. cases, data collection itself comes for free, as all major public
cloud providers offer monitoring APIs, and open-source tools are
abundantly available. However, the following considerations are
432 million
critical: topology updates
• How much manual effort is required for instrumentation and per day.
deployment of updates?
A topology map captures and visualizes the entire application environment. This includes the A service flow map offers a transactional view that illustrates the sequence of service calls
vertical stack (infrastructure, services and processes) and the horizontal dependencies (i.e., all from the perspective of a single service or request. A service flow map displays a step-by-step
incoming and outgoing call relationships.) The best monitoring solution provides auto-discovery sequence of a whole transaction, whereas a topology is a higher abstraction that shows only
of new environment components and near real-time updates. general dependencies. Service flows require high-fidelity data with minimal or no sampling.
Enterprises without AI attempt the impossible. As enterprises embrace a hybrid, multicloud environment, the 52 billion
sheer volume of data and massive environmental complexity will make it impossible for humans to monitor,
dependencies analyzed
comprehend, and take action.
per day to find problem
Challenge
root causes for a
We are quickly entering a time when humans will no longer be the main actors to fix IT problems or push code
into production. Cloud and AI solutions revolve around automation, so DevOps won’t require nearly as much multinational business
human intervention in the future. For AIOps (truly autonomous cloud operations) to work perfectly, we need a
systems company with
system that can not only identify that something is wrong but also pinpoint the true root cause.
17,500 hosts.
Modern, highly dynamic microservices architectures run in hybrid and multicloud environments. Infrastructure
and services are spun up and spun down within the blink of an eye as loads demand. Determining the root cause
of an anomaly requires exponentially more effort than humans can take on.
1. A web app exhibits an anomaly, like a reduced response time (see top
left in the graphic).
2. Davis first “takes a look” at the vertical stack below and finds that
runs on runs on runs on
everything performs as expected—no problems there.
runs on
3. From here, Davis follows all the transactions and detects a dependency
on Service 1 that also shows an anomaly. In addition, all further
dependencies (Services 2 and 3) exhibit anomalies as well.
Webserver Cluster
5. In this case, the root cause is a CPU saturation in one of the Linux Host
Hosts
hosts.
Host
root causes Not every disappearing container or host is a problem, and a slow service that
nobody uses does not require immediate attention. Therefore, an advanced software
Insight intelligence system assesses the severity of a problem:
Infrastructure and services get spun up and spun down as needed at a mind-boggling User impact
speed in a modern dynamic microservice application. That’s the nature of a healthy How many users have been impacted by a detected problem since it
system. occurred? Ideally, the number should be based on actual real users rather
than a statistical extrapolation of historic data.
A disappearing container can be a desired event to optimize resources, or it can be a sign
of an unintended disruption that requires immediate mitigation. The AI needs to be able to Service calls impacted
tell an anomaly from a desired change. Some parts of the system are not built for human interaction. In this case,
the number of affected service calls is a good estimate of the severity.
Challenge
Business impact
A precise and reliable determination of the technical root cause is absolutely essential for
As software intelligence solutions increasingly cover enterprise systems
auto-remediation, but it is not sufficient. We also need a measure of an anomaly’s severity
end-to-end, from user actions all the way to the infrastructure, it is
and some indication of what led to the technical root cause in the first place.
possible to map system performance to business key performance
indicators (KPIs). A retailer, for example, can measure the dollar value of
purchases during a system slowdown and compare it with a reference
time frame in the past.
CD pipeline, ITSM solutions, and other connected tools. 51,782 17.16% 10.42%
vs. yesterday vs. last week
CPU saturation
Dynatrace provides an API and plug-ins to ingest third-party
Checkout 100% CPU usage
data into Davis.
379 8.12% 30.17% Analyze logs
vs. yesterday vs. last week
Order Details
16 44.64% 25.68%
vs. yesterday vs. last week
Auto-remediation
Insight
Infrastructure as code and powerful cloud orchestration layers provide the necessary ingredients to automate operations and enable self-
remediation. This will not only reduce operational cost but also avoid human error. The key to truly autonomous cloud operations is reliable
system health information including deep anomaly root-cause and impact analyses.
Challenge
Enabling auto-remediation
There are many ways of implementing auto-remediation in practice. Full-stack Anomalies Root-cause Problem
environment are detected analysis is notification Event is Job is Playbook Problem
Typically, the software intelligence platform integrates with CI/CD is monitored automatically performed is sent received triggered is executed is remediated
325 1.82
Impacted users Affected service calls
Auto Mitigate!
Problem evolution
100
2
08:00 08:15 08:30 08:45
High garbage collection? Adjust/revert memory settings!
Complex auto-remediation sequences
4 3 Issue with BLUE only? Switch back to GREEN!
This example shows how a precise analysis of the technical
root cause, foundational root causes and user/business
4 Hung threads? Restart service
impact can be used to automate problem resolution through
3
integration with a variety of CI/CD, ITOM, workflow and Update Dev Tickets
? Impact mitigated?
cloud technologies.
2
Mark Bad Commits
5 Still ongoing? Initiate Rollback!
1
Escalate
? Still ongoing?
5
Automation doesn’t stop at software operations and auto-remediation in an enterprise-grade application environment. Accurate and explainable
software intelligence has the capacity to move toward automating the entire digital value chain and to enable novel business processes.
3x
The unbreakable DevOps pipeline This follows the concept of “shift left”—to use more production data earlier
in the development lifecycle to answer the question: “Is this a good or bad
faster build
Over the last years, many DevOps teams have come a long
way in implementing a CI/CD pipeline that codifies and
change before we push in into production?” and test
automates parts of the build, testing, and deployment steps.
cycles, 50%
The goal is to speed up time to market and ensure excellent
software quality—to get faster and better. AI-powered reduction
software intelligence helps to close existing automation
in issues.
gaps like manual approval steps at decision gates or build
Continuous Integration (CI) Continuous Delivery (CD)
validation. It also provides valuable performance signatures to —Verizon Enterprise
PIPELINE
test new builds against production scenarios. Check in Auto Trigger AI powered quality gate
Learn more If you are ready to learn more, please visit www.dynatrace.com/platform for assets, resources, and a free 15-day trial.
Dynatrace (NYSE: DT) exists to make the world’s software work perfectly. Our unified platform combines broad and deep observability and continuous runtime application security with the most advanced AIOps to provide
answers and intelligent automation from data at enormous scale. This enables innovators to modernize and automate cloud operations, deliver software faster and more securely, and ensure flawless digital experiences.
That’s why the world’s largest organizations trust the Dynatrace® platform to accelerate digital transformation.
Curious to see how you can simplify your cloud and maximize the impact of your digital teams? Let us show you. Sign up for a free 15-day Dynatrace trial.
blog @dynatrace
09.12.23 BAE7944_EBK_cs