Truera Whitepaper How To Manage Ai Performance 0408
Truera Whitepaper How To Manage Ai Performance 0408
Truera Whitepaper How To Manage Ai Performance 0408
AI Performance
Full Lifecycle AI Observability vs.
WHITEPAPER
WHITEPAPER
Table of Contents
Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . 14
......................... .................
New Solutions are Required . . . . . .for
................... . . . . . . . . .. .. .. .. .. .. .. .. .. 16
Managing AI Application . . . .Performance
. . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 4 . . . . . . . . .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. . . . . . . . . . .. .. .. .. .. .. .. .. .. 18
Challenges in Managing AI Application.Performance . . . . . . . . . .. .. 5 . . . . . . . . .. .. .. .. .. . . . .
. . . . . . . . . . .. .. . . . . . .. .. .. .. .. .. .. .. .. .. .. .. 20
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... 6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... .............. ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... 8 . . . . . .. .. .. .. .. .. .. .. .. . . . . . . 21
. . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .... .... . . . . . .. .. .. .. .. .. .. .. .. . . . .
. . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .... .... .... .... .... .... .... .... .... .... .... .... . 8 . . . . . . . . . . . . . . . . . 23
. . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .... .... .... .... .... .... .... .... .... .... ..... .... . . . . . . .. .. .. .. .. .
. . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... .... 9 . . . . . . . . . . . . . . .. .. .. .. .. .. .. . .. . .. . .. .. .. 27
. . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... .. .. .. .. .. .. ... ... . . . . . . . . . . . . . . .. .. .. .. .. .. .. . .. . .. . .. .. ..
. . . . . . . .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... ... 9 Key Full Lifecycle Capabilities
. . . . . . . . for
. . .AI. .Quality
. .. .. .. .. ... ... ... ... ... ... ... ... 28
. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. ..
Methods for Managing AI Application . . .. .. .. ..Performance .. .. .. .. .. .. .. .. .. ... .... ... 10 Advantages of
. . . . .. .. .. .. .. .. .. ..
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... . . . . . . for
Full Lifecycle Observability . . .AI. Quality
. . . . .. .. .. .. .. .. . . . . . . 30
........
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... 11 . . .. .. .. .. .. .. .. ..
. . .. .. .. .. .. .. .. .. .. .. .. .. .. ... .... ... . . .. .. .. .. .. .. .. ..
. . . . . . . . . . . .. .. ... ... .... ... .. 12 . . .. .. .. .. .. .. .. ..
. . . . .. .. ..
. . .
. .
. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
. . .. .. ... ... .. . . .. .. .. .. .. .. .. ..
© 2023 All Rights Reserved TruEra www.truera.com . . .. .. .. .. .. . . .. .. .. .. .. .. .. .. 2
.. ..........
WHITEPAPER
Executive Summary
As AI applications become increasingly important to business or Production-only ML Observability is helpful, but not enough
organizational success, achieving and maintaining high model Production-only ML observability software provides advantages over
performance is paramount. Monitoring, debugging, and testing models business KPI-and infrastructure only monitoring by tracking model
in a systematic, comprehensive way is essential. input, output, and performance metrics. However, tracking model
metrics is not enough. ML teams need to be able to identify the root
There are three options for managing ML system performance:
causes of model failures, direct the retraining of models, and test
1. KPI and infrastructure monitoring, combined with model retraining models to validate that the retrained model has addressed the model
2. Production-only observability, and failure and not created new model failure regressions.
3. Full lifecycle AI observability.
Full lifecycle AI Observability is the most effective option
New Solutions
Artificial Intelligence and Machine Learning (AI/ML) systems are increasingly
important to achieving an enterprise’s business Key Performance Indicators
Managing AI
Managing the performance and quality of these systems, however, is challenging.
ML models are complex, hard to explain, and difficult to evaluate. They can fail in
Application
a number of different ways - due to model performance, KPI performance, data
quality, operational factors, and societal factors. ML model performance is
susceptible to drift, particularly concept and data drift.
Challenges in
Managing AI Application
Performance
In order to ensure that ML applications meet their objectives, ML teams
need to be aware of the potential ways in which these applications can
fail. A ML application can fail at multiple layers, and these failures will be
captured by different metrics. These failure scenarios include:
1. Model failures, typically best measured by model performance metrics
2. KPI performance failures, typically best measured by business KPI metrics
3. Data quality issues
4. Failures due to operational factors
5. Failures in societal factors
The most important determinant of a ML system’s success is the ML Data Drift occurs when the statistical properties of the live input data
model at its heart. Most traditional ML models that predict a data label in production is different from the training data used to develop the
are trained to achieve certain model accuracy or error metrics such as model, often leading to inaccurate predictions. Data drift can occur as
Area Under the Curve (AUC), precision, recall, Root Mean Square Error soon as a model goes into production and is sometimes referred to as
(RMSE) and more. ML models are famously susceptible to declines in “training serving skew.” An example of this skew would be a model
these metrics after they are pushed to production. This is because ML developed to forecast grocery product demand. If it is trained mostly
models are initially trained on a static set of data and labels from a on data from the east coast but then used to predict demand across
certain period in time; however, the dynamics determining the the country, it could experience data drift issues immediately.
predicted label in the real world are constantly changing. Specifically,
ML models are known to be susceptible to data and concept drift. Data drift also occurs over time. For example, recently developed credit
decisioning models would have been trained during a period of low
interest rates. If the model was not adequately trained on high interest
loan data, the model’s performance would have undoubtedly declined
ML models are known to
as the Federal Reserve rapidly increased interest rates during 2023.
be susceptible to data and
concept drift.
Concept Drift occurs when the relationship between input features and For example, models used to predict the probability of auto
the ML target changes. This phenomenon typically happens over time and accidents have experienced significant drift due to changes in
often leads to lower performance. In recent years, the Covid-19 pandemic, driving patterns during the Covid-19 pandemic emergency. Prior
supply chain disruptions, inflationary pressures, and the Ukraine war have to the pandemic emergency, features related to driving time and
provided a plethora of examples of concept drift. traffic patterns would have been most important to predicting
accident rates. During the Covid lockdowns, these models would
have predicted massive declines in accidents, due to the sharp
Types of drift decline in traffic volume. This did not occur, however. While
Type of Drift Description When it occurs Impact on Performance Example
accidents did decline somewhat, they did not decline nearly as
much as expected. Instead of being influenced by driving time
Data
Change in production input Over time
Decline or improve*
Increase in interest
data relative to training data rates in a credit model and traffic, accidents were caused by an increase in unsafe
Training
Serving Skew
At launch
Decline**
Decline**
Relationship between
between the input features highway driving time features and the probability of accidents, there was a change in
and the model target
and number of
accidents during the severity of these accidents, as the high speed accidents were
pandemic lockdowns
more severe than accidents prior to the pandemic.
* Data drift will typically produce a change in scores (score drift). Often this reduces model performance, but
there are situations when data drift can improve model performance. For example, take a credit model with
training data with more low interest loans than high. If interest rates were to decline after a period of time when
interest rates were higher than the training data average, then this would likely improve model performance. **
Training serving skew and concept drift can increase or decrease model scores, but they typically produce
negative performance (e.g., accuracy or error) drift.
© 2023 All
All Rights
Rights Reserved
Reserved TruEra
TruEra www.truera.com 7
WHITEPAPER
Challenges in Managing AI
Application Performance
2. KPI Performance 3. Data Quality
Failures Issues
ML models can also fail to meet KPIs due to factors outside of the The quality of production input data can significantly impact
model itself. For example, consider ML ranking models for model and business KPI performance. Low quality data, such as
search. The ranking model may continue to perform by providing null, unexpected, or out-of-bounds values, can result in reduced
relevant results that users want to click on, but business KPIs model accuracy. This is especially true for models dependent on
such as conversion rate or revenue per search can still decline. data that can be difficult to consistently process (such as IoT data)
This can occur for a variety of reasons. The marketing team, for or third-party data. For example, a retail product forecast model
example, could reduce their paid search budget, decreasing would significantly underperform if a third party providing
higher-converting site visitors, which in turn reduces the overall product-related data, such as popularity or pricing information,
conversion rate of search sessions. Even though the search starts sending incorrect data for certain products.
engine is performing at the same level (i.e., it is correctly
identifying products that the user wishes to see), the KPI’s, such
as conversion rate and revenue per search, of the site’s search
system can still decline.
Challenges in Managing AI
Application Performance
4. Failures Due to 5. Failures Due to
Operational Factors Societal Factors
ML models are often not the only component of a ML application ML applications can also fail to achieve goals if they do not adhere
system. The success of a ML application can also be dependent on to societal norms or regulations, such as those related to fairness or
the people and processes related to the application, not just the safety. Much has been written about the risk of model bias in
software and the ML model. For example, the business employment, credit, and advertising decision-making. If a model
performance of a ML system may also be dependent on a human achieves high predictive accuracy but is proven to be biased, it
taking a manual action based on a model prediction. If a model is cannot be regarded as a success. Recent legislation has increased
not trusted or explainable, then the human may ignore the the need for testing, monitoring, and reporting on ML model
model's prediction or not take the best next action based on the performance for societal factors. A New York City AI law regulating
prediction. For example, consider a predictive maintenance model the use of automated employment decision tools by employers went
that predicts the likelihood of a machine failure. If the ML model into effect in 2023. The “Protecting Consumers from Unfair
does not provide the reasons for the machine failure prediction, Discrimination in Insurance Practices” law in Colorado holds insurers
such as machine characteristics driving the prediction (e.g. accountable for ensuring that their AI systems are not unfairly
vibration or temperature outside of acceptable bounds.), then the discriminating against consumers. Regulators, such as the OCC,
engineer will not necessarily trust the prediction or know which CFPB, FTC, and FDA, have adopted guidelines managing AI use. New
potential component to replace. laws, such as the EU’s Artificial Intelligence Act, and similar laws in
Canada, Brazil, and the UK, are also under consideration. All of these
laws essentially require AI developers to regularly explain and audit
the quality (e.g., fairness, safety, and robustness) of their AI systems.
Observability Challenges
Performance
When Only Monitoring KPIs Monitoring Stack
The problem with only measuring KPIs is Limited performance visibility, missed issues not
Business KPIs
surfaced in KPIs
that it creates an AI Observability Gap with
Figure 1:
This gap leads to lower visibility around model performance and For example, an ecommerce recommendation model may appear to be
the causes of model performance issues. Since KPIs can be driving significantly fewer conversions versus prior periods. However, like the
influenced by many factors other than the ML model, lack of search example we discussed earlier, this period of lower performance could
model performance visibility can make debugging KPI be driven more by the marketing team significantly reducing their paid search
performance and attributing it to elements of ML models marketing spend, which brings in naturally higher converting traffic than
exceedingly difficult without additional ways of monitoring. How normal, versus any change in ML model performance. When there is model
does the ML team identify when a KPI decline is due to the model performance observability, situations like this can be easily identified.
or another non-model element of the application?
Without AI observability, understanding the drivers of KPI changes is also
challenging, because KPI changes often lag. ML models can start to
underperform and it may be some time before this is materially reflected in,
How does the team for example, a revenue KPI. By then, however, serious damage could already
be done. Without AI observability, the time lag makes it hard to determine the
identify when a KPI decline root cause of the KPI change is the model versus other factors that could be
is due to the model? occurring at the same time the KPI change is finally observed.
Operational factors
KPI monitoring and retraining also do not address other key failure
scenarios, including operational challenges such as model
explainability or robustness.
Bias situations
Bias issues are often difficult to address with simple retraining alone. If
the bias is caused by data drift, retraining could simply reinforce bias
rather than reducing it.
Monitoring and Observability While production-only observability solutions can help identify KPI,
model performance, and data quality failure scenarios, they
Due to the unique nature of ML models, production- provide limited capabilities to actually fix these problems. These
only observability is ineffective for managing the solutions do not enable teams to effectively and efficiently perform
performance and quality of AI applications. They fall Root Cause Analysis (RCA) of sub-optimal AI performance and
short in two significant ways. quality. Without an understanding of root causes, ML teams
cannot effectively direct model training to address the failure
scenarios. Production-only observability also cannot test models to
validate model quality improvements and prevent quality
regressions. In other words, production-only observability enables
AI monitoring but doesn’t support the other key stages within the
AI development lifecycle: root cause analysis, retraining, and
testing/evaluation.
Production-only observability provides monitoring capabilities, but lacks critical capabilities for debugging, retraining, evaluation, and testing.
2 Evaluating AI quality. Production-only observability solutions also tend to ignore operational and societal failure scenarios.
They focus on monitoring of model performance metrics such as accuracy (e.g., AUC, precision, recall) or error (e.g., MAPE or RMSE)
metrics. Production-only solutions often provide inaccurate or overly general explainability, which make it difficult to identify
sources of failure and to debug them. They also often lack model/segment analytics, bias monitoring,root-cause analysis, risk
management, and auditing capabilities to address operational and societal requirements.
Full Lifecycle
Observability: Easy,
Effective ML Performance
Management
Full lifecycle observability offers complete monitoring, debugging, and
testing across the complete model lifecycle. It provides two key
capabilities that are lacking in production-only observability systems:
advanced root cause analysis and comprehensive testing, evaluation,
explainability, and reporting. These capabilities make it faster and
easier to identify and address emerging model issues, saving critical
time and resources while ensuring that business KPIs are met.
Evaluation, Explainability, In order to effectively manage AI application performance, ML teams need observability software
and Reporting that can test models to validate that the desired improvement has been realized and to ensure
that the new model does not experience a regression in overall or segment performance. To be
Once a model is trained, it needs to be tested able to validate a model improvement, observability software also needs to support deep model
and evaluated before being promoted to comparison and explainability capabilities to show how retraining has modified the behavior of
production. In traditional software problematic features.
development, teams will perform significant
testing, including validation and regression The importance of segment and A/B testing
testing. ML teams need the tools to perform A testing system can only be effective in preventing regressions if it can test against a wide range
similar testing on AI applications, but they lack of performance (e.g., accuracy/error metrics) and quality (e.g., bias or explainability metrics) factors
these today. Instead, AI testing today tends to for the overall model but also critically for key segments, such as business, feature value range,
be manual, ad-hoc, and performed by and model error segments. Otherwise, problems not highlighted by the RCA or unintended, new
individual data scientists with limited tracking, problems created by the retraining process might be missed. Observability software also needs to
whereas traditional software testing is support A/B testing of models by monitoring the performance of these multiple models in
automated, systematic, tracked, and visible. production; performing deep model comparisons and root cause analysis to understand the
drivers of performance differences between the tested models; and running periodic comparative
tests between the two models.
continued
Model 3
Model 4
!
© 2023 All Rights Reserved TruEra www.truera.com 24
WHITEPAPER
“Societal metrics” can be as important as AI applications also need to be tested and assessed against relevant norms, regulations
performance metrics, especially involving social and/or laws. For example, a ML model used for credit decisioning has to be able to
norms or regulated environments demonstrate that it is not biased against protective classes. Models used in medical
devices must demonstrate that they are safe and following “good machine learning best
practices” as defined by the Food and Drug Administration (FDA).
Full lifecycle AI Observability provides all of the critical pieces of monitoring, debugging, and evaluation/testing to quickly return models to high performance levels.
Periodic reporting often critical in identifying and initiating debugging
For example, a recommendations model on an ecommerce site
Most production-only observability systems have been modeled on might naturally underperform during weekends for a B2B business
application performance monitoring systems, which track infrastructure when most corporate procurement staff do not work. Or your
performance metrics such as latency and uptime, and then send alerts to churn prediction might perform badly over a short period if your
DevOps teams. While the fundamentals of this approach are valuable, when company experiences an unexpected outage but this period might
applied to the unique situation of AI systems, it often leads to inefficiencies be legitimately removed from an analysis of the model’s
and alert fatigue. By tracking so many variables constantly across the typical performance given the unexpected nature of the event.
large ML model, teams can get bombarded with a high volume of
inconsequential alerts. For many kinds of ML models, such as forecasting,
churn prediction, marketing, or search/recommendations, performance can
be volatile or seasonal. Performance can also decline or increase for Continuous reporting is
reasons external to a model for a short period of time or due to some black
swan event that needs to be removed from the calculations. not the best solution for
all situations.
AI Quality, continued
debug, and test models using a broader set of metrics that include: includes the ability to measure the above metrics at a segment
level in addition to overall level.
Model performance
for example, traditional accuracy metrics such as RMSE or AUC. Particular to LLMs
Data quality
Relevance
including measures of missing/null values or data anomalies. a measure of the quality of generative content for the use
case, e.g., the relevance of the answer to a question.
Operational factors
including explainability measures like reason codes and risk Human Feedback
Sentiment
that tend to be use case specific but can include revenue,
Each AI use case will require a unique set of performance and quality
cost and profit metrics.
metrics across these categories to monitor, debug, and test AI applications.
User KPIs
A team producing a generative AI use case may select very different metrics
including measurements for user engagement, conversion, than a team producing a credit decisioning model that is evaluated for both
explicit user responses, and so on. accuracy and its ability to meet regulatory requirements for fairness.
Advantages of Full
Lifecycle Observability for
AI Quality
Implementing full lifecycle observability yields better results than only
relying on production-only model monitoring and automated retraining.
Overall, full lifecycle observability software can
The benefits of full lifecycle observability include:
make AI development more systematic. It can
Faster iteration cycle time, enabling teams to improve and retrain reduce guesswork, alert fatigue, and costly and
models more frequently within a fixed period of time, i.e., increased failed debugging exercises. A more systematic
number of model iteration releases per year approach can increase trust of stakeholders to
Improved performance and risk management of ML systems move forward with an AI application. It can also,
throughout their lifecycle: dev and production in turn, lead to faster iterative development. The
Increased efficiency and effectiveness for Data Science, ML more iterative development cycles a ML team
engineering, and MLOps Teams due to faster identification of relevant can execute, the higher the performance and
issues and less wasted time identifying the root cause of those issues robustness of the ML system and the more likely
Better cross-team collaboration, due to rapid issue identification and it is that the organization can turn their AI
analytical transparency systems not only into a source of improved
Reduced operational and cloud costs revenue and cost but also a source of sustainable
competitive advantage.
Reduced risk and an ability to turn Responsible AI commitments into a
source of competitive advantage