12 ModelDeployment

1
SCIENCE
PASSION
TECHNOLOGY
Architecture of ML Systems
12 Model Deployment & Serving
Matthias Boehm
Graz University of Technology, Austria
Computer Science and Biomedical Engineering
Institute of Interactive Systems and Data Science
BMK endowed chair for Data Management
Last update: June 16, 2021

Announcements/Org
2
 #1 Video Recording
 Link in TeachCenter & TUbe (lectures will be public)
 https://tugraz.webex.com/meet/m.boehm
 Corona traffic light RED  May 17: ORANGE  Jul 01: YELLOW
 #2 Programming Projects / Exercises

 Soft deadline: June 30 (w/ room for extension)
 Submission of exercises in TeachCenter
 Submission of projects as PRs in Apache SystemDS
 #3 Exams
 Doodle w/ 42/~50 exam slots (45min each)
 July 7/8/9/12/13 (done via skype/webex)
 #4 Course Evaluation
 Please participate; open period: June 1 – July 15
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Data Science Lifecycle
Recap: The Data Science Lifecycle

3
Data-centric View:
Application perspective
Workload perspective
Data System perspective
Scientist
Data Integration Model Selection Validate & Debug

Data Cleaning Training Deployment
Data Preparation Hyper-parameters Scoring & Feedback
Exploratory Process
(experimentation, refinements, ML pipelines)
Data/SW DevOps
Engineer Engineer

Agenda
4
 Model Exchange and Serving

 Model Monitoring and Updates

5
Model Exchange and Serving

Model Exchange Formats

6
 Definition Deployed Model

 #1 Trained ML model (weight/parameter matrix)
 #2 Trained weights AND operator graph / entire ML pipeline
 especially for DNN (many weight/bias tensors, hyper parameters, etc)
 Recap: Data Exchange Formats (model + meta data)

 General-purpose formats: CSV, JSON, XML, Protobuf
 Sparse matrix formats: matrix market, libsvm
 Scientific formats: NetCDF, HDF5
 ML-system-specific binary formats (e.g., SystemDS, PyTorch serialized)
 Problem ML System Landscape

 Different languages and frameworks, including versions
 Lack of standardization  DSLs for ML is wild west

Model Exchange Formats, cont.

7
 Why Open Standards?

 Open source allows inspection but no control
 Open governance necessary for open standard [Nick Pentreath: Open Standards
for Machine Learning Deployment,
 Cons: needs adoption, moves slowly bbuzz 2019]
 #1 Predictive Model Markup Language (PMML)

 Model exchange format in XML, created by Data Mining Group 1997
 Package model weights, hyper parameters, and limited set of algorithms
 #2 Portable Format for Analytics (PFA)

 Attempt to fix limitations of PMML, created by Data Mining Group
 JSON and AVRO exchange format
 Minimal functional math language  arbitrary custom models
 Scoring in JVM, Python, R

Model Exchange Formats, cont.

8
 #3 Open Neural Network Exchange (ONNX)

 Model exchange format (data and operator graph) via Protobuf
 First Facebook and Microsoft, then IBM, Amazon  PyTorch, MXNet
 Focused on deep learning and tensor operations
 ONNX-ML: support for traditional ML algorithms
 Scoring engine: https://github.com/Microsoft/onnxruntime Lukas Timpl
 Cons: low level (e.g., fused ops), DNN-centric  ONNX-ML python/systemds/
onnx_systemds
 TensorFlow Saved Models

 TensorFlow-specific exchange format for model and operator graph
 Freezes input weights and literals, for additional optimizations
(e.g., constant folding, quantization, etc)
 Cloud providers may not be interested in open exchange standards

ML Systems for Serving

9
 #1 Embedded ML Serving
 TensorFlow Lite and new language bindings (small footprint,
dedicated HW acceleration, APIs, and models: MobileNet, SqueezeNet)
 SystemML JMLC (Java ML Connector)
 #2 ML Serving Services Example:

 Motivation: Complex DNN models, ran on dedicated HW Google Translate
140B words/day
 RPC/REST interface for applications
 82K GPUs in 2016
 TensorFlow Serving: configurable serving w/ batching
 Clipper: Decoupled multi-framework scoring, w/ batching and result caching
 Pretzel: Batching and multi-model optimizations in ML.NET
 Rafiki: Optimization for accuracy under latency constraints, and
batching and multi-model optimizations
[Christopher Olston et al: [Daniel Crankshaw [Yunseong Lee et al.:
[Wei Wang et al: Rafiki:
TensorFlow-Serving: et al: Clipper: A PRETZEL: Opening the Black
Machine Learning as
Flexible, High- Low-Latency Online Box of Machine Learning
an Analytics Service
Performance ML Serving. Prediction Serving Prediction Serving Systems.
706.550 Architecture of Machine Learning Systems – 12 Model System. PVLDB 2018]
NIPS ML Systems 2017] System. NSDI 2017] OSDI 2018]Deployment & Serving
Serverless Computing
10
[Joseph M. Hellerstein et al: Serverless
Computing: One Step Forward, Two
Steps Back. CIDR 2019]
 Definition Serverless
 FaaS: functions-as-a-service (event-driven, stateless input-output mapping)
 Infrastructure for deployment and auto-scaling of APIs/functions
 Examples: Amazon Lambda, Microsoft Azure Functions, etc
Lambda Functions
Event Source
Other APIs
(e.g., cloud
and Services
services)
Auto scaling
Pay-per-request
(1M x 100ms = 0.2$)
import com.amazonaws.services.lambda.runtime.Context;
 Example
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class MyHandler implements RequestHandler<Tuple, MyResponse> {
@Override
public MyResponse handleRequest(Tuple input, Context context) {
return expensiveModelScoring(input); // with read-only model
} 706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
} Matthias Boehm, Graz University of Technology, SS 2021
Example SystemDS JMLC

11
 Example Token Features

Scenario Sentences ΔX
Feature Extraction Sentence

Sentence …
(e.g., doc structure, sentences, Classification
tokenization, n-grams)
Classification (e.g., ⨝, )
“Model”
M
 Challenges
 Scoring part of larger end-to-end pipeline  Embedded scoring
 External parallelization w/o materialization
 Simple synchronous scoring  Latency ⇒ Throughput
 Data size (tiny ΔX, huge model M)  Minimize overhead per ΔX
 Seamless integration & model consistency
 Token inputs & outputs
Example SystemDS JMLC, cont.

12
 Background: Frame Distributed

 Abstract data type with schema Schema representation:
(boolean, int, double, string) ? x ncol(F) blocks
 Column-wise block layout
 Local/distributed operations: (shuffle-free
conversion of
e.g., indexing, append, transform
… csv / datasets)
 Data Preparation FX transformencode X Y

via Transform FY
Training
MX MY B
ΔFX transformapply ΔX
Scoring
ΔFŶ transformdecode ΔŶ

Example SystemML JMLC, cont.

13
 Motivation Typical compiler/runtime overheads:

 Embedded scoring Script parsing and config: ~100ms
 Latency ⇒ Throughput Validation, compile, IPA: ~10ms
 Minimize overhead per ΔX HOP DAG (re-)compile: ~1ms
Instruction execute: <0.1μs
 Example
// single-node, no evictions,
1: Connection conn = new Connection(); // no recompile, no multithread.
2: PreparedScript pscript = conn.prepareScript(
getScriptAsString(“glm-predict-extended.dml”),
new String[]{“FX”,“MX”,“MY”,“B”}, new String[]{“FY”});
3: pscript.setFrame(“MX”,
// ... Setup constant inputs MX, true);
4: pscript.setFrame(“MY”,
for( Document d : documents MY, true);) { // setup static inputs (for reuse)
5: pscript.setMatrix(“B”,
FrameBlock FX = ...;B,//Input true); pipeline
6: pscript.setFrame(“FX”, FX);
7: FrameBlock FY = pscript.executeScript().getFrame(“FY”);
8: // ... Remaining pipeline
// execute precompiled script
9: }
// many times
k
Serving Optimizations – Batching
14
 Recap: Model Batching (see 08 Data Access) n

 One-pass evaluation of multiple configurations O(m*n)
 EL, CV, feature selection, hyper parameter tuning read
m X O(m*n*k)
 E.g.: TUPAQ [SoCC’16], Columbus [SIGMOD’14
compute
m >> n >> k
 Data Batching
 Batching to utilize the HW more efficiently under SLA
 Use case: multiple users use the same model
n
(wait and collect user request and merge)
 Adaptive: additive increase, multiplicative decrease
X1 Benefits for
m X2 multi-class /
complex
X3 models
[Clipper @
NSDI’17]
Serving Optimizations – Quantization

15
 Quantization 08 Data Access

 Lossy compression via ultra-low precision / fixed-point Methods
 Ex.: 62.7% energy spent on data movement [Amirali Boroumand et al.: Google
Workloads for Consumer Devices:
Mitigating Data Movement
 Quantization for Model Scoring Bottlenecks. ASPLOS 2018]
 Usually much smaller data types (e.g., UINT8)
 Quantization of model weights, and sometimes also activations
 reduced memory requirements and better latency / throughput (SIMD)
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
[Credit: https://www.tensorflow.org/lite/performance/post_training_quantization ]

Serving Optimizations – MQO

16
 Result Caching
 Establish a function cache for X  Y
(memoization of deterministic function evaluation)
 Multi Model Optimizations

 Same input fed into multiple partially redundant model evaluations
 Common subexpression elimination between prediction programs
 Done during compilation or runtime
 In PRETZEL, programs compiled into
physical stages and registered
with the runtime + caching for stages
(decided based on hashing the inputs)
[Yunseong Lee et al.: PRETZEL: Opening

the Black Box of Machine Learning
Prediction Serving Systems. OSDI 2018]

Serving Optimizations – Compilation

17
04 Adaptation,
Fusion, and JIT
 TensorFlow tf.compile
 Compile entire TF graph into binary function w/ low footprint
 Input: Graph, config (feeds+fetches w/ fixes shape sizes)
[Chris Leary, Todd Wang:
 Output: x86 binary and C++ header (e.g., inference) XLA – TensorFlow, Compiled!,
 Specialization for frozen model and sizes TF Dev Summit 2017]
 PyTorch Compile
 Compile Python functions into ScriptModule/ScriptFunction
 Lazily collect operations,
a = torch.rand(5)
optimize, and JIT compile def func(x):
 Explicit jit.script call for i in range(10):
or @torch.jit.script x = x * x # unrolled into graph
return x
[Vincent Quenneville-Bélair:
How PyTorch Optimizes jitfunc = torch.jit.script(func) # JIT
Deep Learning Computations, jitfunc.save("func.pt")
Guest Lecture Stanford 2020]

Serving Optimizations – Model Vectorization

18
 HummingBird [https://github.com/microsoft/hummingbird] [Supun Nakandala et al: A

 Compile ML scoring pipelines into tensor ops Tensor Compiler for Unified
Machine Learning Prediction
 Tree-based models (GEMM, 2x tree traversal) Serving. OSDI 2020]
input node pred
path ∑ Bucket-class
Bucket paths:
-1 (lhs) / 0 / 1 (rhs) mapping
[Geoffrey E. Hinton, Oriol Vinyals, Jeffrey

Dean: Distilling the Knowledge in a
Neural Network. CoRR 2015]
 Model Distillation
 Ensembles of models  single NN model
 Specialized models for different classes
(found via differences to generalist
Matthias Boehm, model)
Graz University of Technology, SS 2021
Serving Optimizations – Specialization

19
 NoScope Architecture
 Baseline: YOLOv2 on 1 GPU
per video camera @30fps
 Optimizer to find filters
[Daniel Kang et al: NoScope:
Optimizing Deep CNN-Based
Queries over Video Streams at
Scale. PVLDB 2017]
 #1 Model Specialization
 Given query and baseline model
 Trained shallow NN (based on AlexNet) on output of baseline model
 Short-circuit if prediction with high confidence
 #2 Difference Detection
 Compute difference to ref-image/earlier-frame
 Short-circuit w/ ref label
Matthias Boehm,ifGraz
no University
significant difference
of Technology, SS 2021
20
Model Monitoring and Updates

Part of Model Management and MLOps
(see 10 Model Selection & Management)

Model Deployment Workflow

21
Data Integration Model Selection

#1 Model
Data Cleaning Training
Deployment
Data Preparation Hyper-parameters
MX MY B
#2 Continuous Data Validation /

Prediction Concept Drift Detection
Model Serving
Requests
#3 Model
Monitoring
#4 Periodic / Event-based
DevOps
Re-Training & Updates
Engineer
(automatic / semi-manual)

Monitoring Deployed Models

22
 Goals: Robustness (e.g., data, latency) [Neoklis Polyzotis, Sudip Roy, Steven Whang,
Martin Zinkevich: Data Management Challenges in
and model accuracy Production Machine Learning, SIGMOD 2017]
 #1 Check Deviations Training/Serving Data

 Different data distributions, distinct items  impact on model accuracy?
 See 09 Data Acquisition and Preparation (Data Validation)
 #2 Definition of Alerts
 Understandable and actionable During serving:
 Sensitivity for alerts (ignored if too frequent) 0.11?
 #3 Data Fixes “The question is not whether

something is ‘wrong’. The question is
 Identify problematic parts
whether it gets fixed”
 Impact of fix on accuracy
 How to backfill into training data
Monitoring Deployed Models, cont.

23
 Alert Guidelines [Neoklis Polyzotis, Sudip Roy, Steven Whang,

Martin Zinkevich: Data Management Challenges in
 Make them actionable Production Machine Learning, SIGMOD 2017]
missing field,
less
field has new values,
actionable
distribution changes [George Beskales et al: On the relative
 Question data AND constraints trust between inconsistent data and
inaccurate constraints. ICDE 2013]
 Combining repairs:
[Xu Chu, Ihab F. Ilyas: Qualitative Data
principle of minimality Cleaning. Tutorial, PVLDB 2016]
 Complex Data Lifecycle

 Adding new features to production ML pipelines is a complex process
 Data does not live in a DBMS; data often resides in multiple storage systems
that have different characteristics
 Collecting data for training can be hard and expensive

Concept Drift
24
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Recap Concept Drift (features  labels) PAKDD 2011]
 Change of statistical properties / dependencies (features-labels)

 Requires re-training, parametric approaches for deciding when to retrain
 #1 Input Data Changes

 Population change (gradual/sudden), but also new categories, data errors
 Covariance shift p(x) with constant p(y|x)
 #2 Output Data Changes

 Label shift p(y)
 Constant conditional
feature distributed p(x|y)
 Goals: Fast adaptation; noise vs change, recurring contexts, small overhead

Concept Drift, cont.

25
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Approach 1: Periodic Re-Training PAKDD 2011]
 Training: window of latest data + data selection/weighting

 Alternatives: incremental maintenance, warm starting, online learning
 Approach 2: Event-based Re-Training

 Change detection (supervised, unsupervised)
 Often model-dependent, specific techniques for time series
 Drift Detection Method: binomial distribution, if error outside scaled
standard-deviation  raise warnings and alters
 Adaptive Windowing (ADWIN): [Albert Bifet, Ricard Gavaldà:
Learning from Time-Changing Data
window W, append data to W, drop with Adaptive Windowing. SDM 2007]
old values until avg windows W=W1-W2
[https://scikitmultiflow.readthedocs.io/en
similar (below epsillon), raise alerts /stable/api/generated/
 Kolmogorov-Smirnov distance / Chi-Squared: skmultiflow.drift_detection.ADWIN.html]
univariate statistical tests training/serving

Concept Drift, cont.

26
[Sebastian Schelter, Tammo Rukat, Felix

Bießmann: Learning to Validate the
 Model-agnostic Performance Predictor Predictions of Black Box Classifiers on
Unseen Data. SIGMOD 2020]
 Approach 2: Event-based Re-Training
 User-defined error generators
 Synthetic data corruption  impact on black-box model
 Train performance predictor (regression/classification at threshold t)
for expected prediction quality on percentiles of target variable ŷ
 Results PPM

GDPR (General Data Protection Regulation)

27
 GDPR “Right to be Forgotten”

 Recent laws such as GDPR require
companies and institutions to
delete user data upon request
 Personal data must not only be deleted
from primary data stores but also from
ML models trained on it (Recital 75)
[https://gdpr.eu/article-17-right-to-be-forgotten/]
 Example Deanonymization
 Recommender systems: models X ≈ UV
┬
retain user similarly

 Social network data / clustering / KNN
[Sebastian Schelter: "Amnesia" -
 Large language models (e.g., GPT-3) Machine Learning Models That Can
Forget User Data Very Fast. CIDR 2020]

GDPR, cont.
28 [Sebastian Schelter, Stefan Grafberger, Ted Dunning:
HedgeCut: Maintaining Randomised Trees for Low-
Latency Machine Unlearning, SIGMOD 2021]
 HedgeCut Overview
 Extremely Randomized Trees (ERT):
ensemble of DTs w/ randomized
attributes and cut-off points
 Online unlearning requests < 1ms
w/o retraining for few points
 Handling of Non-robust Splits

Summary and Conclusions
29
 Model Exchange and Serving

 Model Monitoring and Updates
 #1 Finalize Programming Projects by ~June 30

 #2 Oral Exam
 Doodle for July 7/8/9/12/13, 45min each (done via skype/webex)
 Part 1: Describe you programming project, warm-up questions
 Part 2: Questions on 2-3 topics of 11 lectures
(basic understanding of the discussed topics / techniques)


12 ModelDeployment

Uploaded by

Copyright:

Available Formats

12 ModelDeployment

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 ModelDeployment

Uploaded by

Copyright:

Available Formats

1

Last update: June 16, 2021

 #2 Programming Projects / Exercises

Recap: The Data Science Lifecycle

Data Integration Model Selection Validate & Debug

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

 Model Exchange and Serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Model Exchange and Serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Model Exchange Formats

 Definition Deployed Model

 Recap: Data Exchange Formats (model + meta data)

 Problem ML System Landscape

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Model Exchange Formats, cont.

 Why Open Standards?

 #1 Predictive Model Markup Language (PMML)

 #2 Portable Format for Analytics (PFA)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Model Exchange Formats, cont.

 #3 Open Neural Network Exchange (ONNX)

 TensorFlow Saved Models

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

ML Systems for Serving

 #2 ML Serving Services Example:

Example SystemDS JMLC

 Example Token Features

Feature Extraction Sentence

Example SystemDS JMLC, cont.

 Background: Frame Distributed

 Data Preparation FX transformencode X Y

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Example SystemML JMLC, cont.

 Motivation Typical compiler/runtime overheads:

 Recap: Model Batching (see 08 Data Access) n

Serving Optimizations – Quantization

 Quantization 08 Data Access

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Serving Optimizations – MQO

 Multi Model Optimizations

[Yunseong Lee et al.: PRETZEL: Opening

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Serving Optimizations – Compilation

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Serving Optimizations – Model Vectorization

 HummingBird [https://github.com/microsoft/hummingbird] [Supun Nakandala et al: A

[Geoffrey E. Hinton, Oriol Vinyals, Jeffrey

Serving Optimizations – Specialization

Model Monitoring and Updates

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Model Deployment Workflow

Data Integration Model Selection

#2 Continuous Data Validation /

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving

Monitoring Deployed Models

 #1 Check Deviations Training/Serving Data

 #3 Data Fixes “The question is not whether

Monitoring Deployed Models, cont.