5853 Aiops Done Right Ebook Dynatrace

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

AIOps done right

Automation and precise answers powered by causal and predictive AI

©2023 Dynatrace
What's inside
INTRODUCTION

The promise of AI
INTRODUCTION
CHAPTER 1

The promise of AI Anomaly detection and alerting


CHAPTER 2
AI is driving the next innovation cycle in enterprise software¹, enabling new levels of intelligent automation and
vertical integration. As today’s enterprise systems increase in size, the benefits of digitization and cloud computing
Getting the best monitoring data
go hand in hand with technological complexity and operational risks. AI-powered software intelligence holds the CHAPTER 3
promise to tackle these challenges and enable a new generation of autonomous cloud enterprise systems. AI operations and root cause analysis
CHAPTER 4

Impact analysis and foundational root causes


CHAPTER 5

Auto-remediation
CHAPTER 6

Automation and system integrations

1
AI Technologies — William Blair Industry Report, June 28, 2018

AIOps done right | 2


Beyond error detection,
towards self-healing The promise of AI
Consider this all-too-familiar challenge: An anomaly Enable autonomous operations, boost innovation, and offer new
in a large microservices-based application triggers modes of customer engagement by automating everything.
a storm of alerts as services around the globe are
affected. As your application contains literally millions
of dependencies, how do you find the original error?
Root-cause analysis Intelligent DevOps
Conventional monitoring tools are not much help. They
collect metrics and raise alerts, but they provide few Replace a storm of noisy anomaly Increase the speed of innovation and
answers as to what went wrong in the first place. alerts with accurate and reliable software quality through intelligent
root-cause analysis. performance and regression testing.
In contrast, envision an intelligent system that
accurately provides the answers—in this case, the
technical root cause of the anomaly and how to fix it.
Such intelligence, if accurate and reliable, can be trusted
to trigger auto-remediation procedures before most
Auto-remediation Smart customer engagement
users even notice a glitch.
Automate anomaly remediation and Use business intelligence data to
AI and automation are poised to radically change the
performance optimization based on improve customer experience, including
game in operations. And even more, it’s about collecting
system health and real user demands. automatic remediation of breakdowns
and applying intelligence along the entire digital
and complaints.
value chain, from software development through
service delivery all the way to customer interactions.
Smart integration and automation will drive the next
innovation cycle in enterprise software.

AIOps done right | 3


“Dynatrace is the first I’ve seen where the AI really
shines. Incredible.” A proven record in AIOps
—Ariel Molina, Sr. Dir., Software Engineering & Enterprise
Dynatrace helps the world’s top global organizations with simplifying cloud complexity and accelerating
Architecture, Carnival Cruise Line
digital transformation. At the core of the Dynatrace platform is Davis, our causal AI engine. Unlike
“Dynatrace, within two minutes came back and said ‘you correlation-based machine learning approaches, Davis is designed to handle the complexity of modern cloud
have a problem in your cloud instance’, and we spun up environments. Davis processes trillions of dependencies in real time, continuously monitors the full stack for
extra resources. So we avoided having to close down system degradation and performance anomalies, and delivers precise answers prioritized by business impact
supermarkets and disappoint customers waiting in line.” and with root-cause determination. This enables development, IT, security, and business teams to spend less

—Jeppe Hedesgaard Lindberg, Application Performance effort troubleshooting and more time innovating and achieving transformative business outcomes.
Manager, Coop Denmark
How to avoid closing down 500 supermarkets on a busy Saturday
“We fire up Dynatrace, and immediately the AI goes to
Coop, Denmark’s largest food retail group, celebrated its 150th anniversary by digitizing its business and
work and identifies problems. There’s no digging—it’s
moving 80% of its core apps into the cloud.
bubbling to the top. It’s right there in your face. It just
does it for you; it’s amazing.” In 2016, Coop launched its new customer loyalty solution and an updated point-of-sale software. Despite
—Steve Strout, Director, Platform Engineering, Assurant extensive testing, a problem developed soon after the launch in production—checkout registers froze when
trying to print out receipts. Suddenly, Coop was facing the prospect of having to close 500 of its stores on a
“The AI paves the way for autonomous operations,
busy Saturday morning because its payment systems were down.
enabling us to create auto-remediation workflows
that remove the need for human intervention in the However, two minutes after the first problems occurred in a couple of stores, Dynatrace monitoring software
resolution of recurring problems.” was able to pinpoint the root cause: a lack of CPU power in the Azure cloud. A major breakdown was avoided
by simply spinning up additional resources on the fly.
—David Shepherd, Service Delivery Manager, Global IT Service
Excellence, Experian

AIOps done right | 4


CHAPTER 1

Anomaly detection and alerting


Insight

The concept of automating operations revolves around better troubleshooting, with


the ultimate goal to reduce the mean time to recovery (MTTR). This is accomplished
through automatic anomaly detection and alerting, i.e., speedy mean time to discovery BEFORE:

(MTTD). However, further reduction of MTTR requires automatic root-cause analysis. without Dynatrace
Number
of Calls

Challenge Manual Manual


Communication Communication SLA

Traditional monitoring tools focus on application performance metrics and baselining


20 ENGAGE 20 TRIAGE 45 FIND & ASSEMBLE 30 RESOLVE 35 RESTORE
methods to distinguish normal from faulty behavior. Defining the anomaly thresholds Before

turns out to be a tricky task that requires advanced statistics like machine learning. 120
90%
However, even the best baselining methods prove to be inadequate when it comes Faster AFTER:
to the cloud. 40%
Number Faster with Dynatrace
of Calls
With modern microservices architectures, a single failure can impact a multitude and incidence response service
of connected services, which subsequently also fail. Therefore, a single problem -90% SLA

Calls
can trigger many alerts. This is called an alert storm or noisy alerts.
5 5 5 10 RESOLVE 35 RESTORE
After
(×) (×)
Conventional monitoring solutions fall short of resolving this issue. It remains up
120
to human operators to make sense of the alerts. Problem triage becomes a time-
consuming and often frustrating exercise involving war rooms and long hours.

The only way out is a reliable method for determining the underlying
root-cause automatically.

AIOps done right | 5


Fine-tuning individual baselines helps, but it does not fix alert storms. For a real cure, we need to step outside the box
and try to find the underlying root cause directly.

There are two very distinct AI-based approaches to reduce alert noise:

A large health
Deterministic AI using fault-tree analysis Statistical correlation-based AI
insurer with 350
hosts captures

900,000
events
per minute and

200,000 • Deterministic AI performs step-bystep


fault-tree analysis commonly used in
• Machine learning AI uses a statistical
approach that correlates data in a

measures
safety engineering multidimensional model

• Works in near real time • Building the model takes time and lags
per minute. in dynamic environments
• Easily visualize results to pinpoint root
causes and help you understand impact • Requires human interpretation
to determine root causes

AIOps done right | 6


CHAPTER 2

Getting the best monitoring data


Insight Challenge

In an environment of disparate monitoring tools, operations Full system visibility is a necessary precondition for automating
personnel are left to make sense of multiple diverse inputs operations, including solid self-remediation. We need full insight
coming from various sources. This increases the likelihood of not only into the application—including containers and functions-
error in situational awareness and diagnosis. Currently, only 5% as-a-service—but also into all layers of the cloud infrastructure,
A big airline with
of applications are monitored.² The aim is to get full end-to-end networks, the CI/CD pipeline, and the real user experience. In many 2,500 hosts has
visibility. cases, data collection itself comes for free, as all major public
cloud providers offer monitoring APIs, and open-source tools are
abundantly available. However, the following considerations are
432 million
critical: topology updates
• How much manual effort is required for instrumentation and per day.
deployment of updates?

• Can the monitoring agents inject themselves into ephemeral


components like functions or containers, and do configuration
changes require additional manual instrumentation?

• Are the metrics coarsely sampled or high-fidelity?

• Is there enough meta-information and context to build a unifying


data model?
Use AIOps for a Data-Driven Approach to Improve Insights From IT Operations Monitoring Tools
2

(Gartner Research Note)

AIOps done right | 7


Rich data in context
To accomplish true root-cause analysis, the collected data need to be high-fidelity (minimal or no sampling)
and context-rich in order to create real-time topology and service flow maps.

Topology map Service flow map

A topology map captures and visualizes the entire application environment. This includes the A service flow map offers a transactional view that illustrates the sequence of service calls
vertical stack (infrastructure, services and processes) and the horizontal dependencies (i.e., all from the perspective of a single service or request. A service flow map displays a step-by-step
incoming and outgoing call relationships.) The best monitoring solution provides auto-discovery sequence of a whole transaction, whereas a topology is a higher abstraction that shows only
of new environment components and near real-time updates. general dependencies. Service flows require high-fidelity data with minimal or no sampling.

AIOps done right | 8


CHAPTER 3

AI operations and root-cause analysis


Insight

Enterprises without AI attempt the impossible. As enterprises embrace a hybrid, multicloud environment, the 52 billion
sheer volume of data and massive environmental complexity will make it impossible for humans to monitor,
dependencies analyzed
comprehend, and take action.
per day to find problem
Challenge
root causes for a
We are quickly entering a time when humans will no longer be the main actors to fix IT problems or push code
into production. Cloud and AI solutions revolve around automation, so DevOps won’t require nearly as much multinational business
human intervention in the future. For AIOps (truly autonomous cloud operations) to work perfectly, we need a
systems company with
system that can not only identify that something is wrong but also pinpoint the true root cause.
17,500 hosts.
Modern, highly dynamic microservices architectures run in hybrid and multicloud environments. Infrastructure
and services are spun up and spun down within the blink of an eye as loads demand. Determining the root cause
of an anomaly requires exponentially more effort than humans can take on.

AIOps done right | 9


Root cause analysis with deterministic AI
Davis—the Dynatrace AI engine—uses application topology and service
flow maps together with high-fidelity metrics to perform a fault tree
analysis. A fault tree shows all the vertical and horizontal topological
dependencies for a given alert. Consider the following example visualized
in the chart to the right.

1. A web app exhibits an anomaly, like a reduced response time (see top
left in the graphic).

2. Davis first “takes a look” at the vertical stack below and finds that
runs on runs on runs on
everything performs as expected—no problems there.
runs on

3. From here, Davis follows all the transactions and detects a dependency
on Service 1 that also shows an anomaly. In addition, all further
dependencies (Services 2 and 3) exhibit anomalies as well.
Webserver Cluster

4. The automatic root-cause detection includes all the relevant vertical


stacks as shown in the example and ranks the contributors to
determine the one with the most negative impact.

5. In this case, the root cause is a CPU saturation in one of the Linux Host
Hosts
hosts.
Host

Deterministic AI automatically and accurately determines the technical


anomaly root cause. This is a necessary precondition for true AIOps. We’ll
Host
go deeper into the requirements auto-remediation in the next sections.

AIOps done right | 10


Understanding problem evolution
Deterministic fault-tree analysis yields precise, explainable
results. This can be used to replay the evolution and resolution
of a problem step by step and visualize the affected
components in a topology map. This is an extremely powerful
feature because it allows the DevOps team to gain a deep
understanding of the problem right from the get-go, cutting
triage and research time to a minimum.

The problem evolution data is key for auto-remediation. Given


that it can be accessed through application programming
interfaces (APIs), remediation sequences can be triggered to
resolve a problem with surgical precision and at a speed not
achievable by human operators.

AIOps done right | 11


CHAPTER 4

Impact analysis and foundational Impact severity

root causes Not every disappearing container or host is a problem, and a slow service that
nobody uses does not require immediate attention. Therefore, an advanced software
Insight intelligence system assesses the severity of a problem:

Infrastructure and services get spun up and spun down as needed at a mind-boggling User impact
speed in a modern dynamic microservice application. That’s the nature of a healthy How many users have been impacted by a detected problem since it
system. occurred? Ideally, the number should be based on actual real users rather
than a statistical extrapolation of historic data.
A disappearing container can be a desired event to optimize resources, or it can be a sign
of an unintended disruption that requires immediate mitigation. The AI needs to be able to Service calls impacted
tell an anomaly from a desired change. Some parts of the system are not built for human interaction. In this case,
the number of affected service calls is a good estimate of the severity.
Challenge
Business impact
A precise and reliable determination of the technical root cause is absolutely essential for
As software intelligence solutions increasingly cover enterprise systems
auto-remediation, but it is not sufficient. We also need a measure of an anomaly’s severity
end-to-end, from user actions all the way to the infrastructure, it is
and some indication of what led to the technical root cause in the first place.
possible to map system performance to business key performance
indicators (KPIs). A retailer, for example, can measure the dollar value of
purchases during a system slowdown and compare it with a reference
time frame in the past.

AIOps done right | 12


2 Applications: User action duration degradation
Foundational root causes Problem 753 detected at Nov 28 06:58–Nov 28 07:54 (was open for 56 minutes). This problem affects real users.

Affected applications Affected services Affected infrastructure


The technical root cause determines what is broken.
2 15 3 654,998,400
Discrepancies analyzed
The foundational root cause specifies why it is broken.

Typical foundational root causes are the following:


Business impact analysis Root cause
• Deployments: Collecting metrics and events from the An analysis of all affected service calls and Based on our dependency analysis, all incidents have the
CI/CD tool chain makes it possible to link a problem to a impacted real users during the first 10 minutes of same root cause.
the problem shows the following potential impact.
specific deployment (and roll it back if needed).
1.17k 384k Check Destination
Custom service
• Third-party configuration changes: These can relate Impacted users Affected service calls
Show more
to changes in the underlying cloud infrastructure or a Response time degradation

third-party service. The current response time (19.6 s) exceeds


the auto-detected baseline (120 ms) by 16,309%

• Infrastructure availability: In many cases, the


Business metric analysis Affected requests Service method
shutdown or restart of hosts or individual processes Additional analysis performed on key business metrics such 551/min All methods affected
causes the problem. as conversion goals or revenue numbers. Comparisons are
done for the problem time frame yesterday and a week ago.
To determine the foundational root causes, the AI engine
Basket BB1-apache-tomcatjms-iis
needs to have access to metrics and events from the CI/ Host

CD pipeline, ITSM solutions, and other connected tools. 51,782 17.16% 10.42%
vs. yesterday vs. last week
CPU saturation
Dynatrace provides an API and plug-ins to ingest third-party
Checkout 100% CPU usage
data into Davis.
379 8.12% 30.17% Analyze logs
vs. yesterday vs. last week

Order Details

16 44.64% 25.68%
vs. yesterday vs. last week

AIOps done right | 13


CHAPTER 5

Auto-remediation
Insight

Infrastructure as code and powerful cloud orchestration layers provide the necessary ingredients to automate operations and enable self-
remediation. This will not only reduce operational cost but also avoid human error. The key to truly autonomous cloud operations is reliable
system health information including deep anomaly root-cause and impact analyses.

Challenge

Many cloud platforms offer mechanisms to restart unhealthy hosts


and services or dynamically adjust resources based on load demand.
Some of these solutions are very advanced—however, they work only
within their designed scope. Software intelligence solutions cover the
entire enterprise system end to end, including hybrid environments
where mainframes exist along multiple cloud platforms.
Observability and AIOps Platform CI/CD Automation

Enabling auto-remediation

There are many ways of implementing auto-remediation in practice. Full-stack Anomalies Root-cause Problem
environment are detected analysis is notification Event is Job is Playbook Problem
Typically, the software intelligence platform integrates with CI/CD is monitored automatically performed is sent received triggered is executed is remediated

solutions or with cloud platform configuration layers to execute


remediation actions. In any case, the software intelligence solution
needs to provide full-stack monitoring, automatic anomaly detection,
precise root-cause analysis and problem notification through APIs.

AIOps done right | 14


Path forward with AI and Automation: Auto-Remediation, Self-Healing...

www.easytravel.com Escalate at 2am?

325 1.82
Impacted users Affected service calls

Auto Mitigate!
Problem evolution
100

50 ? 1 CPU exhausted? Add a new service instance!

2
08:00 08:15 08:30 08:45
High garbage collection? Adjust/revert memory settings!
Complex auto-remediation sequences
4 3 Issue with BLUE only? Switch back to GREEN!
This example shows how a precise analysis of the technical
root cause, foundational root causes and user/business
4 Hung threads? Restart service
impact can be used to automate problem resolution through
3
integration with a variety of CI/CD, ITOM, workflow and Update Dev Tickets
? Impact mitigated?
cloud technologies.
2
Mark Bad Commits
5 Still ongoing? Initiate Rollback!
1

Escalate
? Still ongoing?
5

AIOps done right | 15


CHAPTER 6

Automation and system integrations


Insight

Automation doesn’t stop at software operations and auto-remediation in an enterprise-grade application environment. Accurate and explainable
software intelligence has the capacity to move toward automating the entire digital value chain and to enable novel business processes.

3x
The unbreakable DevOps pipeline This follows the concept of “shift left”—to use more production data earlier
in the development lifecycle to answer the question: “Is this a good or bad
faster build
Over the last years, many DevOps teams have come a long
way in implementing a CI/CD pipeline that codifies and
change before we push in into production?” and test
automates parts of the build, testing, and deployment steps.
cycles, 50%
The goal is to speed up time to market and ensure excellent
software quality—to get faster and better. AI-powered reduction
software intelligence helps to close existing automation
in issues.
gaps like manual approval steps at decision gates or build
Continuous Integration (CI) Continuous Delivery (CD)
validation. It also provides valuable performance signatures to —Verizon Enterprise
PIPELINE
test new builds against production scenarios. Check in Auto Trigger AI powered quality gate

1 BUILD 2 DEV 3 BETA 4 PROD

AIOps done right | 16


Ma
ry
Hi D
ir
per k! We
for w
Automating customer service und manc ant to
er h e of apo
you gh i o u r log
Any good software intelligence solution needs to include real-user data. r pa p res web ize fo
tien sure site r t
Organizations can use an impact analysis (as described in chapter 4) to ensure ce. to f tod he po
ix t ay. o
customer satisfaction even if something goes wrong. his We r
. Th ’r
ank e wo
you rking
In case of a breakdown or slowdown, the system can engage autonomously for
with impacted users. One way is to open a chat window operated by a
chatbot behind the scenes and inform the customer about the specific
performance issue, then offer to make it up to them by providing discounts Dir
k
or other compensation. Tha
nk
you
!

AIOps done right | 17


AI and Automation based on holistic observability
data hold the key to true AIOps.
We hope this ebook has inspired you to take the next step in your digital journey. Dynatrace is committed
to providing enterprises the data and intelligence they need to be successful with their enterprise cloud
and digital transformation initiatives, no matter how complex.

Learn more If you are ready to learn more, please visit www.dynatrace.com/platform for assets, resources, and a free 15-day trial.

Dynatrace (NYSE: DT) exists to make the world’s software work perfectly. Our unified platform combines broad and deep observability and continuous runtime application security with the most advanced AIOps to provide
answers and intelligent automation from data at enormous scale. This enables innovators to modernize and automate cloud operations, deliver software faster and more securely, and ensure flawless digital experiences.
That’s why the world’s largest organizations trust the Dynatrace® platform to accelerate digital transformation.
Curious to see how you can simplify your cloud and maximize the impact of your digital teams? Let us show you. Sign up for a free 15-day Dynatrace trial.

blog @dynatrace

09.12.23 BAE7944_EBK_cs

You might also like