Truera Whitepaper How To Manage Ai Performance 0408

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

How to Manage

AI Performance
Full Lifecycle AI Observability vs.

Production-Only and KPI Monitoring

WHITEPAPER
WHITEPAPER

Table of Contents

Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . 14
......................... .................
New Solutions are Required . . . . . .for
................... . . . . . . . . .. .. .. .. .. .. .. .. .. 16
Managing AI Application . . . .Performance
. . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 4 . . . . . . . . .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . .. .. .. .. .. .. .. .. .. 18
Challenges in Managing AI Application.Performance . . . . . . . . . .. .. 5 . . . . . . . . .. .. .. .. .. . . . .
. . . . . . . . . . .. .. . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... .............. ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... 8 . . . . . .. .. .. .. .. .. .. .. .. . . . . . . 21
. . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .... .... . . . . . .. .. .. .. .. .. .. .. .. . . . .
. . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .... .... .... .... .... .... .... .... .... .... .... .... . 8 . . . . . . . . . . . . . . . . . 23
. . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .... .... .... .... .... .... .... .... .... .... ..... .... . . . . . . .. .. .. .. .. .
. . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... .... 9 . . . . . . . . . . . . . . .. .. .. .. .. .. .. . .. . .. . .. .. .. 27
. . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... .. .. .. .. .. .. ... ... . . . . . . . . . . . . . . .. .. .. .. .. .. .. . .. . .. . .. .. ..
. . . . . . . .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... ... 9 Key Full Lifecycle Capabilities
. . . . . . . . for
. . .AI. .Quality
. .. .. .. .. ... ... ... ... ... ... ... ... 28
. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. ..
Methods for Managing AI Application . . .. .. .. ..Performance .. .. .. .. .. .. .. .. .. ... .... ... 10 Advantages of
. . . . .. .. .. .. .. .. .. ..
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... . . . . . . for
Full Lifecycle Observability . . .AI. Quality
. . . . .. .. .. .. .. .. . . . . . . 30
........
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... 11 . . .. .. .. .. .. .. .. ..
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... . . .. .. .. .. .. .. .. ..
. . . . . . . . . . . .. .. ... ... .... ... .. 12 . . .. .. .. .. .. .. .. ..
. . . . .. .. ..
. . .
. .
. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
© 2023 All Rights Reserved TruEra www.truera.com . . .. .. .. .. .. . . .. .. .. .. .. .. .. .. 2
.. ..........
WHITEPAPER

Executive Summary
As AI applications become increasingly important to business or Production-only ML Observability is helpful, but not enough

organizational success, achieving and maintaining high model Production-only ML observability software provides advantages over
performance is paramount. Monitoring, debugging, and testing models business KPI-and infrastructure only monitoring by tracking model
in a systematic, comprehensive way is essential. input, output, and performance metrics. However, tracking model
metrics is not enough. ML teams need to be able to identify the root
There are three options for managing ML system performance:
causes of model failures, direct the retraining of models, and test
1. KPI and infrastructure monitoring, combined with model retraining models to validate that the retrained model has addressed the model
2. Production-only observability, and failure and not created new model failure regressions.
3. Full lifecycle AI observability.
Full lifecycle AI Observability is the most effective option

Full lifecycle AI observability software provides the most effective


Business KPI and infrastructure monitoring alone create an AI solution for managing AI application quality by not only addressing the
Observability Gap
AI observability gap but also by providing the root cause analysis,
Monitoring business KPIs and infrastructure in conjunction with regular evaluation, and testing capabilities needed to resolve model failures
retraining is where most ML teams start. However, it leaves an and improve AI application performance. By tracking more
observability gap by not tracking the ML model inputs, outputs, and comprehensive quality metrics, explaining AI models, and enabling
performance metrics. This approach does not address all five AI responsible AI, full lifecycle AI observability software can address all five
application failure modes, including the common AI model challenges model failure modes and enable ML teams to build and maintain
of data and concept drift. higher quality AI applications.

© 2023 All Rights Reserved TruEra www.truera.com 3


WHITEPAPER

New Solutions
Artificial Intelligence and Machine Learning (AI/ML) systems are increasingly
important to achieving an enterprise’s business Key Performance Indicators

are Required for


(KPIs), such as revenue, EBITDA, or new customers. As a result, improving AI
application performance can result in substantial revenue or cost improvements.

Managing AI
Managing the performance and quality of these systems, however, is challenging.
ML models are complex, hard to explain, and difficult to evaluate. They can fail in

Application
a number of different ways - due to model performance, KPI performance, data
quality, operational factors, and societal factors. ML model performance is
susceptible to drift, particularly concept and data drift.

Performance Systematically managing these dynamics requires a new set of technology


capabilities and approaches. In practice, ML teams typically have three primary
approaches for managing ML system performance: (1) KPI monitoring and model
retraining, (2) ML Production Monitoring/Observability and (3) Full lifecycle AI
observability and AI quality management.
Systematically managing This whitepaper will cover the unique challenges of managing AI application
these dynamics requires performance, the typical failure modes, and the three primary approaches for
managing system performance, along with their strengths and weaknesses.
a new set of technology If you are just starting your AI journey or are well along your way, we hope that
capabilities... this whitepaper will help you to understand how to drive better performance
through the key tenets of AI observability: monitoring, debugging, and testing.

© 2023 All Rights Reserved TruEra www.truera.com 4


WHITEPAPER

Challenges in

Managing AI Application
Performance
In order to ensure that ML applications meet their objectives, ML teams
need to be aware of the potential ways in which these applications can
fail. A ML application can fail at multiple layers, and these failures will be
captured by different metrics. These failure scenarios include:
1. Model failures, typically best measured by model performance metrics
2. KPI performance failures, typically best measured by business KPI metrics
3. Data quality issues
4. Failures due to operational factors
5. Failures in societal factors

Any approach to managing the performance of a ML system needs to


consider all of these failure scenarios in order to be effective. These
factors also represent the levers that ML teams can use to improve the
performance of their ML application.

© 2023 All Rights Reserved TruEra www.truera.com 5


WHITEPAPER

Challenges in Managing AI Application Performance


1. Model Performance Failures

The most important determinant of a ML system’s success is the ML Data Drift occurs when the statistical properties of the live input data
model at its heart. Most traditional ML models that predict a data label in production is different from the training data used to develop the
are trained to achieve certain model accuracy or error metrics such as model, often leading to inaccurate predictions. Data drift can occur as
Area Under the Curve (AUC), precision, recall, Root Mean Square Error soon as a model goes into production and is sometimes referred to as
(RMSE) and more. ML models are famously susceptible to declines in “training serving skew.” An example of this skew would be a model
these metrics after they are pushed to production. This is because ML developed to forecast grocery product demand. If it is trained mostly
models are initially trained on a static set of data and labels from a on data from the east coast but then used to predict demand across
certain period in time; however, the dynamics determining the the country, it could experience data drift issues immediately.
predicted label in the real world are constantly changing. Specifically,
ML models are known to be susceptible to data and concept drift. Data drift also occurs over time. For example, recently developed credit
decisioning models would have been trained during a period of low
interest rates. If the model was not adequately trained on high interest
loan data, the model’s performance would have undoubtedly declined
ML models are known to 
 as the Federal Reserve rapidly increased interest rates during 2023.
be susceptible to data and
concept drift.

© 2023 All Rights Reserved TruEra www.truera.com 6


WHITEPAPER

Challenges in Managing AI Application Performance


1. Model Performance Failures, continued

Concept Drift occurs when the relationship between input features and For example, models used to predict the probability of auto
the ML target changes. This phenomenon typically happens over time and accidents have experienced significant drift due to changes in
often leads to lower performance. In recent years, the Covid-19 pandemic, driving patterns during the Covid-19 pandemic emergency. Prior
supply chain disruptions, inflationary pressures, and the Ukraine war have to the pandemic emergency, features related to driving time and
provided a plethora of examples of concept drift. traffic patterns would have been most important to predicting
accident rates. During the Covid lockdowns, these models would
have predicted massive declines in accidents, due to the sharp
Types of drift decline in traffic volume. This did not occur, however. While
Type of Drift Description When it occurs Impact on Performance Example
accidents did decline somewhat, they did not decline nearly as
much as expected. Instead of being influenced by driving time
Data
Change in production input Over time
Decline or improve*
Increase in interest
data relative to training data rates in a credit model and traffic, accidents were caused by an increase in unsafe
Training
Serving Skew

Change in production input


data relative to training data

At launch

Decline**

Use of forecast model


on geographies not in
driving habits, such as young drivers driving at high speed on
training data
empty roads. In addition to changes in the relationship between
Concept

Change in relationship Over time

Decline**

Relationship between
between the input features highway driving time features and the probability of accidents, there was a change in
and the model target

and number of
accidents during the severity of these accidents, as the high speed accidents were
pandemic lockdowns
more severe than accidents prior to the pandemic.
* Data drift will typically produce a change in scores (score drift). Often this reduces model performance, but
there are situations when data drift can improve model performance. For example, take a credit model with
training data with more low interest loans than high. If interest rates were to decline after a period of time when
interest rates were higher than the training data average, then this would likely improve model performance. **
Training serving skew and concept drift can increase or decrease model scores, but they typically produce
negative performance (e.g., accuracy or error) drift.

© 2023 All
All Rights
Rights Reserved
Reserved TruEra
TruEra www.truera.com 7
WHITEPAPER

Challenges in Managing AI
Application Performance
2. KPI Performance 3. Data Quality

Failures Issues
ML models can also fail to meet KPIs due to factors outside of the The quality of production input data can significantly impact
model itself. For example, consider ML ranking models for model and business KPI performance. Low quality data, such as
search. The ranking model may continue to perform by providing null, unexpected, or out-of-bounds values, can result in reduced
relevant results that users want to click on, but business KPIs model accuracy. This is especially true for models dependent on
such as conversion rate or revenue per search can still decline. data that can be difficult to consistently process (such as IoT data)
This can occur for a variety of reasons. The marketing team, for or third-party data. For example, a retail product forecast model
example, could reduce their paid search budget, decreasing would significantly underperform if a third party providing
higher-converting site visitors, which in turn reduces the overall product-related data, such as popularity or pricing information,
conversion rate of search sessions. Even though the search starts sending incorrect data for certain products.
engine is performing at the same level (i.e., it is correctly
identifying products that the user wishes to see), the KPI’s, such
as conversion rate and revenue per search, of the site’s search
system can still decline.

© 2023 All Rights Reserved TruEra www.truera.com 8


WHITEPAPER

Challenges in Managing AI
Application Performance
4. Failures Due to 5. Failures Due to
Operational Factors Societal Factors
ML models are often not the only component of a ML application ML applications can also fail to achieve goals if they do not adhere
system. The success of a ML application can also be dependent on to societal norms or regulations, such as those related to fairness or
the people and processes related to the application, not just the safety. Much has been written about the risk of model bias in
software and the ML model. For example, the business employment, credit, and advertising decision-making. If a model
performance of a ML system may also be dependent on a human achieves high predictive accuracy but is proven to be biased, it
taking a manual action based on a model prediction. If a model is cannot be regarded as a success. Recent legislation has increased
not trusted or explainable, then the human may ignore the the need for testing, monitoring, and reporting on ML model
model's prediction or not take the best next action based on the performance for societal factors. A New York City AI law regulating
prediction. For example, consider a predictive maintenance model the use of automated employment decision tools by employers went
that predicts the likelihood of a machine failure. If the ML model into effect in 2023. The “Protecting Consumers from Unfair
does not provide the reasons for the machine failure prediction, Discrimination in Insurance Practices” law in Colorado holds insurers
such as machine characteristics driving the prediction (e.g. accountable for ensuring that their AI systems are not unfairly
vibration or temperature outside of acceptable bounds.), then the discriminating against consumers. Regulators, such as the OCC,
engineer will not necessarily trust the prediction or know which CFPB, FTC, and FDA, have adopted guidelines managing AI use. New
potential component to replace. laws, such as the EU’s Artificial Intelligence Act, and similar laws in
Canada, Brazil, and the UK, are also under consideration. All of these
laws essentially require AI developers to regularly explain and audit
the quality (e.g., fairness, safety, and robustness) of their AI systems.

© 2023 All Rights Reserved TruEra www.truera.com 9


WHITEPAPER

Methods for Managing 



AI Application Performance

Any team developing AI applications needs to thoughtfully manage the


performance of the application in order to minimize these failure
scenarios and achieve their business KPIs. Three common approaches
that ML teams implement include:
KPI and infrastructure monitoring, combined with model retraining
Production-Only Monitoring and Observability
Full-Lifecycle AI Observability and AI quality management

These approaches are not mutually exclusive. In fact, many ML teams


start with measuring and monitoring application KPIs, along with some
level of manual or automated model retraining, as their initial approach
to managing ML application performance. KPI-only monitoring,
however, often makes it impossible to effectively manage the
performance and quality of ML applications. As teams realize this, they
will typically make an investment in some form of ML observability. This
section covers the benefits and limitations of each approach.

© 2023 All Rights Reserved TruEra www.truera.com 10


WHITEPAPER

Methods for Managing AI Application Performance


KPI, Instructure Monitoring, and Model
Retraining - Essential, but Watch Out
for the AI Observability Gap
Measuring the KPIs that the ML system is intended to impact, such as
revenue, profitability, new customers, or upsells is essential to the success of
the project. If the application doesn't improve organizational KPIs, then it
likely won't stay in production for long. KPI monitoring can rapidly identify
many KPI failure scenarios. ML teams also tend to already have software that
can measure KPIs such as data warehousing and business Intelligence (BI)
systems. Similarly, DevOps teams are used to monitoring application
infrastructure metrics such as latency, CPU utilization, storage and memory
utilization, with their existing application performance monitoring
observability tools. These tools can effectively be used to keep AI application
infrastructure performant and available as with other applications.

However, KPI and infrastructure monitoring, while essential, is not


sufficient. It creates an “AI Observability Gap” in the overall AI application
performance monitoring stack. As will be covered in detail in the following
section, it only addresses one of the AI application specific failure scenarios
and cannot address the other four scenarios: ML model performance, data
quality, operational factor, or societal factor failure scenarios.

© 2023 All Rights Reserved TruEra www.truera.com 11


WHITEPAPER

Methods for Managing AI Application Performance

KPI Monitoring and Model Retraining,


continued

Observability Challenges
Performance
When Only Monitoring KPIs Monitoring Stack

The problem with only measuring KPIs is Limited performance visibility, missed issues not
Business KPIs
surfaced in KPIs
that it creates an AI Observability Gap with

Lower model performance

respect to ML model output and inputs.


(data quality, missed improvement opportunities)
While the KPIs are being analyzed by the
Model Metrics
AI
Unworkable, time consuming &

business intelligence system and the


Observability unresolved debugging

infrastructure is being monitored, there is


Gap
Model Data Unexplained model behavior,

nothing paying attention to the model data


lower stakeholder trust & collaboration

and the model performance metrics. This

means that the organization is blind as to Responsible AI risks


Infrastructure
what may be going wrong to negatively

impact the business KPIs.


KPI-only monitoring leaves an organization blind to a significant part of the ML stack.

Figure 1:

Challenges Created by the AI Observability Gap

© 2023 All Rights Reserved TruEra www.truera.com 12


WHITEPAPER

Methods for Managing AI Application Performance


KPI Monitoring and Model Retraining, continued

This gap leads to lower visibility around model performance and For example, an ecommerce recommendation model may appear to be
the causes of model performance issues. Since KPIs can be driving significantly fewer conversions versus prior periods. However, like the
influenced by many factors other than the ML model, lack of search example we discussed earlier, this period of lower performance could
model performance visibility can make debugging KPI be driven more by the marketing team significantly reducing their paid search
performance and attributing it to elements of ML models marketing spend, which brings in naturally higher converting traffic than
exceedingly difficult without additional ways of monitoring. How normal, versus any change in ML model performance. When there is model
does the ML team identify when a KPI decline is due to the model performance observability, situations like this can be easily identified.
or another non-model element of the application?
Without AI observability, understanding the drivers of KPI changes is also
challenging, because KPI changes often lag. ML models can start to
underperform and it may be some time before this is materially reflected in,
How does the team for example, a revenue KPI. By then, however, serious damage could already
be done. Without AI observability, the time lag makes it hard to determine the
identify when a KPI decline root cause of the KPI change is the model versus other factors that could be

is due to the model? occurring at the same time the KPI change is finally observed.

© 2023 All Rights Reserved TruEra www.truera.com 13


WHITEPAPER

Methods for Managing AI Application Performance


KPI Monitoring and Model Retraining, continued

The Limits of Automated Retraining


In addition to KPI monitoring, many ML teams find an easy way to Situations where automated retraining are insufficient include:
manage ML system performance is by retraining models. Model
Sudden shifts in data relationships
retraining can help address some kinds of data drift. ML teams Automated training would, for example, have been insufficient to
will often schedule or automate model retraining. handle the large change in driving behavior in the auto accident
prediction model discussed earlier. Training a model based on
However, automated retraining has limitations. This approach is pre-lockdown and post-lockdown data would not produce the best
often insufficient to address concept drift or more sharp, large, or model for post-lockdown behavior.
seasonal data drift challenges. In these situations, automatically
appending new data and retraining will produce an averaging Seasonality
effect that won’t fully adjust to concept or data drift. Automated retraining also isn’t optimal for seasonal patterns. For
example, ecommerce behavior can change significantly during sale
or holiday periods. Automatically retraining models across these
periods produces models that don’t perform well when a sale or
holiday begins and then don’t perform when a sale or holiday
ends. For many AI use cases, retraining needs to be more directed
versus blindly automated.

© 2023 All Rights Reserved TruEra www.truera.com 14


WHITEPAPER

Methods for Managing AI Application Performance


KPI Monitoring and Model Retraining,
continued

Data quality challenges


Model retraining is of little help with data quality issues; when the process used
to extract, transform, and load data from one or more sources into the ML
model doesn’t work correctly.

Operational factors
KPI monitoring and retraining also do not address other key failure
scenarios, including operational challenges such as model
explainability or robustness.

Bias situations
Bias issues are often difficult to address with simple retraining alone. If
the bias is caused by data drift, retraining could simply reinforce bias
rather than reducing it.

© 2023 All Rights Reserved TruEra www.truera.com 15


WHITEPAPER

Methods for Managing AI Application Performance


Production-Only Monitoring and
Observability - Valuable, 

but Inadequate
Another approach to managing model performance is to implement ML
monitoring or observability software that is focused on the
performance of models in production only. This software tracks model
metrics of live models, calculates drift in these metrics, and then fires
off alerts to notify ML teams of a potential issue. These systems can
help address the limitations of KPI monitoring and non-directed
retraining by helping to fill the Observability Gap.

© 2023 All Rights Reserved TruEra www.truera.com 16


WHITEPAPER

Methods for Managing AI Application Performance


Production-Only Monitoring and Observability,
continued
Specifically, the benefits of production-only monitoring and observability include: These capabilities, in conjunction with KPI
monitoring, and infrastructure monitoring, and
Continuous monitoring of model performance metrics some level of retraining, can help address the
Tracking model performance metrics, such as various accuracy (e.g. RMSE, AUC, where and the when of KPI, model
precision, recall, etc.) on a consistent time series basis (e.g. hourly, daily, weekly). performance, and data quality failure
scenarios. They can enable ML teams to more
Continuous monitoring of data drift precisely identify whether there is a problem
Calculating drift of ML model output (e.g., scores) and performance metrics and where in the AI performance stack the
relative to a baseline on a consistent time series (e.g., hour, day, week) basis problem exists, e.g., data quality, model input,
model performance metrics, or KPI layer,
Continuous monitoring of data quality including whether the ML production model is
Track the quality of data, such as null value or inaccurate input date on a experiencing concept, data drift, and/or many
consistent time series (e.g., hourly, daily, weekly) basis data quality issues. This information can also
be used to make model retraining more
Alerting directed from a timing perspective.
The ability to set thresholds on any metric and send an alert when the threshold
Directly monitoring ML models in production
is crossed.
makes it far easier to identify quickly when
things are going wrong and can help provide
Time analytics and some directional debugging information
The ability to calculate metrics for a period of time when an issue has been some direction in debugging. However, while
detected relative to a comparison period, sometimes known as “before and after” these capabilities are an improvement over
analysis, and use this variance to make educated guesses as to the causes of KPI KPI monitoring, they still only solve part of
or model performance issues. the performance management problem.

© 2023 All Rights Reserved TruEra www.truera.com 17


WHITEPAPER

Methods for Managing AI Application Performance


Production-Only Monitoring and Observability, continued

Limits of Production-Only 1 Observability across the full model lifecycle.

Monitoring and Observability While production-only observability solutions can help identify KPI,
model performance, and data quality failure scenarios, they
Due to the unique nature of ML models, production- provide limited capabilities to actually fix these problems. These
only observability is ineffective for managing the solutions do not enable teams to effectively and efficiently perform
performance and quality of AI applications. They fall Root Cause Analysis (RCA) of sub-optimal AI performance and
short in two significant ways. quality. Without an understanding of root causes, ML teams
cannot effectively direct model training to address the failure
scenarios. Production-only observability also cannot test models to
validate model quality improvements and prevent quality
regressions. In other words, production-only observability enables
AI monitoring but doesn’t support the other key stages within the
AI development lifecycle: root cause analysis, retraining, and
testing/evaluation.

© 2023 All Rights Reserved TruEra www.truera.com 18


WHITEPAPER

Methods for Managing AI Application Performance


Production-Only Monitoring and Observability, continued

Figure 2: Production-Only Observability in the ML Lifecycle

Machine Learning (ML) Lifecycle

Development Deployment Production Observability


Train Eval/Test Release Monitor Analyze RCA/Retrain

Production-only observability provides monitoring capabilities, but lacks critical capabilities for debugging, retraining, evaluation, and testing.

2 Evaluating AI quality. Production-only observability solutions also tend to ignore operational and societal failure scenarios.

They focus on monitoring of model performance metrics such as accuracy (e.g., AUC, precision, recall) or error (e.g., MAPE or RMSE)
metrics. Production-only solutions often provide inaccurate or overly general explainability, which make it difficult to identify
sources of failure and to debug them. They also often lack model/segment analytics, bias monitoring,root-cause analysis, risk
management, and auditing capabilities to address operational and societal requirements.

© 2023 All Rights Reserved TruEra www.truera.com 19


WHITEPAPER

Full Lifecycle
Observability: Easy,
Effective ML Performance
Management
Full lifecycle observability offers complete monitoring, debugging, and
testing across the complete model lifecycle. It provides two key
capabilities that are lacking in production-only observability systems:
advanced root cause analysis and comprehensive testing, evaluation,
explainability, and reporting. These capabilities make it faster and
easier to identify and address emerging model issues, saving critical
time and resources while ensuring that business KPIs are met.

© 2023 All Rights Reserved TruEra www.truera.com 20


WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability, continued

The analytics provided in production-only observability systems have several


Advanced Root Cause Analysis short-comings including:

Root Cause Analysis (RCA) is the most important part


Not identifying root causes.
of any observability solution. It is critical to identify There are a number of situations where these guesses can be wrong, as
why a problem is occurring and how to retrain models features may drift but not be the primary root causes of a change in model
to address the problem. Monitoring AI is of limited performance.
value without RCA. Many production monitoring tools
provide limited capabilities to identify features or Frequently identifying unimportant features.
Some features that drift might not be important enough to influence the model
segments driving errors, drift, overfitting, bias, performance change. This could be because they were never important or
unsound behavior, or comparative model because the importance of a feature may have changed over time.
performance. Some tools attempt to address this
challenge and have defined themselves as Not addressing root causes of quality metrics like bias.
observability tools. However, these tools essentially Production monitoring tools also tend to not be able to perform root cause
provide model and feature stats to help perform analysis on non-performance or quality metrics such as bias.
before-and-after analyses but they leave the actual
identification of root causes to guesswork based on Not enabling multi-model comparison.
observing potential features or segments that might When ML teams are trying to understand the differences in performance
be correlated to the drift. between multiple models during model selection or A/B testing, correlation RCA
cannot provide necessary insights.

© 2023 All Rights Reserved TruEra www.truera.com 21


WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability, continued

Advanced root cause analysis is key to rapid and effective debugging.

Unlike correlation analysis-based analytics, advanced root cause analysis:

Identifies real contributors to overfitting, bias, and


performance, concept, or score drift Advanced root cause analysis of
using highly accurate explainability analytics, so that teams don’t have
to guess at root causes or work on irrelevant features. model errors, drift, bias, overfitting,
and comparative model
Supports segment as well as overall root cause analysis, performance is critical to
providing insights that enable ML teams to improve models for important
business segments in ways that would not otherwise be possible. understanding how to change
training data or model
Addresses a full spectrum of metrics,
hyperparameters and retrain models
including both performance and quality metrics. with a “directed vs. blind” approach

Makes model performance comparison easy.


Precise analytics make it possible to quickly understand which model is
providing better performance across multiple analytical dimensions.

© 2023 All Rights Reserved TruEra www.truera.com 22


WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability, continued

Comprehensive Testing, AI needs to be tested more like traditional software

Evaluation, Explainability, In order to effectively manage AI application performance, ML teams need observability software
and Reporting that can test models to validate that the desired improvement has been realized and to ensure
that the new model does not experience a regression in overall or segment performance. To be
Once a model is trained, it needs to be tested able to validate a model improvement, observability software also needs to support deep model
and evaluated before being promoted to comparison and explainability capabilities to show how retraining has modified the behavior of
production. In traditional software problematic features.
development, teams will perform significant
testing, including validation and regression The importance of segment and A/B testing

testing. ML teams need the tools to perform A testing system can only be effective in preventing regressions if it can test against a wide range
similar testing on AI applications, but they lack of performance (e.g., accuracy/error metrics) and quality (e.g., bias or explainability metrics) factors
these today. Instead, AI testing today tends to for the overall model but also critically for key segments, such as business, feature value range,
be manual, ad-hoc, and performed by and model error segments. Otherwise, problems not highlighted by the RCA or unintended, new
individual data scientists with limited tracking, problems created by the retraining process might be missed. Observability software also needs to
whereas traditional software testing is support A/B testing of models by monitoring the performance of these multiple models in
automated, systematic, tracked, and visible. production; performing deep model comparisons and root cause analysis to understand the
drivers of performance differences between the tested models; and running periodic comparative
tests between the two models.

© 2023 All Rights Reserved TruEra www.truera.com 23


WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability,

continued

Why is regression testing necessary?

An example. Production-only observability systems lack these testing,


evaluation and model comparison capabilities. As a
result, not only do they fail at root cause analysis but
A retailer using a ML ranking model for search who had seen an overall they can’t help ML teams validate improvements or
improvement in accuracy metrics provides an example of the need for prevent regressions. The lack of these full lifecycle
regression testing. When this retailer looked at the performance of the capabilities significantly slows down model feedback
model across their key business segments, they noticed that the new cycles and productive model improvement iterations,
model underperformed the prior model on their most profitable which are the key to building high quality AI applications.
segments. When 20% of your customers produce 80% of your profits, a
model that performs well on the less profitable 80% but underperforms
Performance
on the profitable 20% is not the right model to promote to production.
Similarly, ML teams would not want to promote a model that performs Model 1

better overall but is unacceptably biased against protected classes. Model 2

Model 3

Model 4

!
© 2023 All Rights Reserved TruEra www.truera.com 24
WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability, continued

“Societal metrics” can be as important as AI applications also need to be tested and assessed against relevant norms, regulations
performance metrics, especially involving social and/or laws. For example, a ML model used for credit decisioning has to be able to
norms or regulated environments demonstrate that it is not biased against protective classes. Models used in medical
devices must demonstrate that they are safe and following “good machine learning best
practices” as defined by the Food and Drug Administration (FDA).

Figure 3: Full Lifecycle AI Observability and the ML Lifecycle

Machine Learning (ML) Lifecycle

Development Deployment Production Observability


Train Eval/Test Release Monitor Analyze RCA/Retrain

Full lifecycle AI Observability provides all of the critical pieces of monitoring, debugging, and evaluation/testing to quickly return models to high performance levels.

© 2023 All Rights Reserved TruEra www.truera.com 25


WHITEPAPER

Methods for Managing AI Application Performance


Full Lifecycle Observability, continued

Periodic reporting often critical in identifying and initiating debugging 
 For example, a recommendations model on an ecommerce site
Most production-only observability systems have been modeled on might naturally underperform during weekends for a B2B business
application performance monitoring systems, which track infrastructure when most corporate procurement staff do not work. Or your
performance metrics such as latency and uptime, and then send alerts to churn prediction might perform badly over a short period if your
DevOps teams. While the fundamentals of this approach are valuable, when company experiences an unexpected outage but this period might
applied to the unique situation of AI systems, it often leads to inefficiencies be legitimately removed from an analysis of the model’s
and alert fatigue. By tracking so many variables constantly across the typical performance given the unexpected nature of the event.
large ML model, teams can get bombarded with a high volume of
inconsequential alerts. For many kinds of ML models, such as forecasting,
churn prediction, marketing, or search/recommendations, performance can
be volatile or seasonal. Performance can also decline or increase for Continuous reporting is
reasons external to a model for a short period of time or due to some black
swan event that needs to be removed from the calculations. not the best solution for 

all situations.

© 2023 All Rights Reserved TruEra www.truera.com 26


WHITEPAPER

The right timescale is key: 



continuous vs. periodic analysis
Directed Retraining: made possible by Advanced
RCA, segment analytics, and model comparison
For these use cases, continuous monitoring and
threshold-based alerting are not effective. Volatile
or seasonal performance changes will produce In order to take any action to optimize performance and quality of production AI
large numbers of useless alerts that create alert systems, observability software needs to be able to:
fatigue. To be effective, observability systems
1. Identify time periods with changes in score, accuracy, or bias, and precisely
need to calculate metrics on the right timescale
calculate drift.
and then compare that timescale to another
relevant period of time (e.g., week-over-week, 2. Identify segments that have contributed most to the change in the metric.
month-over-month, or year-over-year) to identify 3. Calculate the contribution of each feature to the change in metric.
if there have been significant changes in these 4. Support analyses needed during development, not just production. For
metrics (e.g., +/- 10%), flagging different levels of example, observability software should automatically identify high error
changes (e.g., green, yellow, or red). segments that are contributing to lower model error/accuracy in development.
Once the root cause of this overfitting or bias is identified, it can be addressed in
With their DevOps heritage, production-only
development, as well as used in production to more precisely monitor models.
observability systems often cannot adequately
5. Conduct multi-model analysis. Compare metrics and perform root cause
serve this need, focusing on continuous
monitoring and threshold-based alerts. To enable analysis across multiple models.
periodic reporting, observability systems need Advanced root cause analysis, segment analysis, and model comparison enable 

more sophisticated production testing the adoption of directed retraining methodologies. While the optimal retraining
capabilities. If you are only doing continuous method depends on use case, directed retraining is particularly helpful in the 

monitoring, your observability system might be many cases where simple automated retraining will not suffice (e.g., when there 

taking your team on repeated wild rides, leading are issues with concept drift, bias, or data quality.) It is also helpful when ML teams
them to eventually turn down alerts to more perform retraining on-demand: the observability system identifies an issue, and
manageable levels, but at the cost of missed retraining is determined by the root cause analysis. In these cases, directed
failure scenarios. retraining provides a faster, more sophisticated resolution.

© 2023 All Rights Reserved TruEra www.truera.com 27


WHITEPAPER

Key Full Lifecycle Case example:


Saving weeks of effort with full
Capabilities for AI Quality lifecycle observability

A credit card company saw a sudden drop in


credit approvals recommended by its ML-
Full lifecycle observability also has the advantage of providing a powered credit decisioning system. The model
broader set of metrics for evaluating model performance. Production- was highly complex, with hundreds of features
only observability systems tend to focus only on a narrow band of ML influencing the outcome. The team tried using in-
model performance metrics instead of also evaluating the broader house, open-source tools to identify and address
quality of AI systems. This limitation is becoming even more the problem, but the simple observability and
pronounced as more AI systems use Generative AI technology. The explainability solution was proving ineffective
performance of Generative AI cannot be assessed with traditional and frustrating. Multiple times, the team spent
accuracy metrics. For example, “hallucinations,” or the confident time chasing down “causes” that turned out to be
provision of false information, isn’t captured with traditional metrics. dead ends.

The team then used TruEra, a full-lifecycle AI


observability solution, to try to identify the
source of the problem. The solution quickly
identified a surprising, unexpected feature that
It would have taken six weeks to was causing the problem - consumer flight
find what TruEra identified in a day. history data. This feature would have been
- Data Scientist, Major Credit Card Company
otherwise overlooked, or taken weeks to identify.
Once this feature was properly addressed, the
model returned to high performance.

© 2023 All Rights Reserved TruEra www.truera.com 28


WHITEPAPER

Key Full Lifecycle Capabilities for

AI Quality, continued

To manage AI system quality, ML teams need to be able to monitor, Segment metrics

debug, and test models using a broader set of metrics that include: includes the ability to measure the above metrics at a segment
level in addition to overall level.
Model performance

for example, traditional accuracy metrics such as RMSE or AUC. Particular to LLMs

Data quality
Relevance


including measures of missing/null values or data anomalies. a measure of the quality of generative content for the use
case, e.g., the relevance of the answer to a question.
Operational factors

including explainability measures like reason codes and risk Human Feedback

measures. measures of the quality of generative content based on


assessments by human users or evaluators.
Societal factors

Sentiment

including bias, auditability/documentation, safety, and robustness. 


a measure of the sentiment or toxicity of generative content.
Business KPIs


that tend to be use case specific but can include revenue,
Each AI use case will require a unique set of performance and quality
cost and profit metrics.
metrics across these categories to monitor, debug, and test AI applications.
User KPIs
A team producing a generative AI use case may select very different metrics
including measurements for user engagement, conversion, than a team producing a credit decisioning model that is evaluated for both
explicit user responses, and so on. accuracy and its ability to meet regulatory requirements for fairness.

© 2023 All Rights Reserved TruEra www.truera.com 29


WHITEPAPER

Advantages of Full
Lifecycle Observability for
AI Quality
Implementing full lifecycle observability yields better results than only
relying on production-only model monitoring and automated retraining.
Overall, full lifecycle observability software can
The benefits of full lifecycle observability include:
make AI development more systematic. It can
Faster iteration cycle time, enabling teams to improve and retrain reduce guesswork, alert fatigue, and costly and
models more frequently within a fixed period of time, i.e., increased failed debugging exercises. A more systematic
number of model iteration releases per year approach can increase trust of stakeholders to
Improved performance and risk management of ML systems move forward with an AI application. It can also,
throughout their lifecycle: dev and production in turn, lead to faster iterative development. The
Increased efficiency and effectiveness for Data Science, ML more iterative development cycles a ML team
engineering, and MLOps Teams due to faster identification of relevant can execute, the higher the performance and
issues and less wasted time identifying the root cause of those issues robustness of the ML system and the more likely
Better cross-team collaboration, due to rapid issue identification and it is that the organization can turn their AI
analytical transparency systems not only into a source of improved
Reduced operational and cloud costs revenue and cost but also a source of sustainable
competitive advantage.
Reduced risk and an ability to turn Responsible AI commitments into a
source of competitive advantage

© 2023 All Rights Reserved TruEra www.truera.com 30

You might also like