12 ModelDeployment

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 29

1

SCIENCE
PASSION
TECHNOLOGY

Architecture of ML Systems
12 Model Deployment & Serving
Matthias Boehm
Graz University of Technology, Austria
Computer Science and Biomedical Engineering
Institute of Interactive Systems and Data Science
BMK endowed chair for Data Management

Last update: June 16, 2021


Announcements/Org
2

 #1 Video Recording
 Link in TeachCenter & TUbe (lectures will be public)
 https://tugraz.webex.com/meet/m.boehm
 Corona traffic light RED  May 17: ORANGE  Jul 01: YELLOW

 #2 Programming Projects / Exercises


 Soft deadline: June 30 (w/ room for extension)
 Submission of exercises in TeachCenter
 Submission of projects as PRs in Apache SystemDS

 #3 Exams
 Doodle w/ 42/~50 exam slots (45min each)
 July 7/8/9/12/13 (done via skype/webex)

 #4 Course Evaluation
 Please participate; open period: June 1 – July 15
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Data Science Lifecycle

Recap: The Data Science Lifecycle


3
Data-centric View:
Application perspective
Workload perspective
Data System perspective
Scientist

Data Integration Model Selection Validate & Debug


Data Cleaning Training Deployment
Data Preparation Hyper-parameters Scoring & Feedback

Exploratory Process
(experimentation, refinements, ML pipelines)
Data/SW DevOps
Engineer Engineer

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Agenda
4

 Model Exchange and Serving


 Model Monitoring and Updates

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
5

Model Exchange and Serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats


6

 Definition Deployed Model


 #1 Trained ML model (weight/parameter matrix)
 #2 Trained weights AND operator graph / entire ML pipeline
 especially for DNN (many weight/bias tensors, hyper parameters, etc)

 Recap: Data Exchange Formats (model + meta data)


 General-purpose formats: CSV, JSON, XML, Protobuf
 Sparse matrix formats: matrix market, libsvm
 Scientific formats: NetCDF, HDF5
 ML-system-specific binary formats (e.g., SystemDS, PyTorch serialized)

 Problem ML System Landscape


 Different languages and frameworks, including versions
 Lack of standardization  DSLs for ML is wild west

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats, cont.


7

 Why Open Standards?


 Open source allows inspection but no control
 Open governance necessary for open standard [Nick Pentreath: Open Standards
for Machine Learning Deployment,
 Cons: needs adoption, moves slowly bbuzz 2019]

 #1 Predictive Model Markup Language (PMML)


 Model exchange format in XML, created by Data Mining Group 1997
 Package model weights, hyper parameters, and limited set of algorithms

 #2 Portable Format for Analytics (PFA)


 Attempt to fix limitations of PMML, created by Data Mining Group
 JSON and AVRO exchange format
 Minimal functional math language  arbitrary custom models
 Scoring in JVM, Python, R

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Model Exchange Formats, cont.


8

 #3 Open Neural Network Exchange (ONNX)


 Model exchange format (data and operator graph) via Protobuf
 First Facebook and Microsoft, then IBM, Amazon  PyTorch, MXNet
 Focused on deep learning and tensor operations
 ONNX-ML: support for traditional ML algorithms
 Scoring engine: https://github.com/Microsoft/onnxruntime Lukas Timpl
 Cons: low level (e.g., fused ops), DNN-centric  ONNX-ML python/systemds/
onnx_systemds

 TensorFlow Saved Models


 TensorFlow-specific exchange format for model and operator graph
 Freezes input weights and literals, for additional optimizations
(e.g., constant folding, quantization, etc)
 Cloud providers may not be interested in open exchange standards

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

ML Systems for Serving


9

 #1 Embedded ML Serving
 TensorFlow Lite and new language bindings (small footprint,
dedicated HW acceleration, APIs, and models: MobileNet, SqueezeNet)
 SystemML JMLC (Java ML Connector)

 #2 ML Serving Services Example:


 Motivation: Complex DNN models, ran on dedicated HW Google Translate
140B words/day
 RPC/REST interface for applications
 82K GPUs in 2016
 TensorFlow Serving: configurable serving w/ batching
 Clipper: Decoupled multi-framework scoring, w/ batching and result caching
 Pretzel: Batching and multi-model optimizations in ML.NET
 Rafiki: Optimization for accuracy under latency constraints, and
batching and multi-model optimizations
[Christopher Olston et al: [Daniel Crankshaw [Yunseong Lee et al.:
[Wei Wang et al: Rafiki:
TensorFlow-Serving: et al: Clipper: A PRETZEL: Opening the Black
Machine Learning as
Flexible, High- Low-Latency Online Box of Machine Learning
an Analytics Service
Performance ML Serving. Prediction Serving Prediction Serving Systems.
706.550 Architecture of Machine Learning Systems – 12 Model System. PVLDB 2018]
NIPS ML Systems 2017] System. NSDI 2017] OSDI 2018]Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serverless Computing
10
[Joseph M. Hellerstein et al: Serverless
Computing: One Step Forward, Two
Steps Back. CIDR 2019]
 Definition Serverless
 FaaS: functions-as-a-service (event-driven, stateless input-output mapping)
 Infrastructure for deployment and auto-scaling of APIs/functions
 Examples: Amazon Lambda, Microsoft Azure Functions, etc

Lambda Functions
Event Source
Other APIs
(e.g., cloud
and Services
services)
Auto scaling
Pay-per-request
(1M x 100ms = 0.2$)

import com.amazonaws.services.lambda.runtime.Context;
 Example
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class MyHandler implements RequestHandler<Tuple, MyResponse> {
@Override
public MyResponse handleRequest(Tuple input, Context context) {
return expensiveModelScoring(input); // with read-only model
} 706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
} Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemDS JMLC


11

 Example Token Features


Scenario Sentences ΔX

Feature Extraction Sentence


Sentence …
(e.g., doc structure, sentences, Classification
tokenization, n-grams)
Classification (e.g., ⨝, )

“Model”
M

 Challenges
 Scoring part of larger end-to-end pipeline  Embedded scoring
 External parallelization w/o materialization
 Simple synchronous scoring  Latency ⇒ Throughput
 Data size (tiny ΔX, huge model M)  Minimize overhead per ΔX
 Seamless integration & model consistency
 Token inputs & outputs
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemDS JMLC, cont.


12

 Background: Frame Distributed


 Abstract data type with schema Schema representation:
(boolean, int, double, string) ? x ncol(F) blocks
 Column-wise block layout
 Local/distributed operations: (shuffle-free
conversion of
e.g., indexing, append, transform
… csv / datasets)

 Data Preparation FX transformencode X Y


via Transform FY
Training
MX MY B

ΔFX transformapply ΔX
Scoring
ΔFŶ transformdecode ΔŶ

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Example SystemML JMLC, cont.


13

 Motivation Typical compiler/runtime overheads:


 Embedded scoring Script parsing and config: ~100ms
 Latency ⇒ Throughput Validation, compile, IPA: ~10ms
 Minimize overhead per ΔX HOP DAG (re-)compile: ~1ms
Instruction execute: <0.1μs
 Example
// single-node, no evictions,
1: Connection conn = new Connection(); // no recompile, no multithread.
2: PreparedScript pscript = conn.prepareScript(
getScriptAsString(“glm-predict-extended.dml”),
new String[]{“FX”,“MX”,“MY”,“B”}, new String[]{“FY”});
3: pscript.setFrame(“MX”,
// ... Setup constant inputs MX, true);
4: pscript.setFrame(“MY”,
for( Document d : documents MY, true);) { // setup static inputs (for reuse)
5: pscript.setMatrix(“B”,
FrameBlock FX = ...;B,//Input true); pipeline
6: pscript.setFrame(“FX”, FX);
7: FrameBlock FY = pscript.executeScript().getFrame(“FY”);
8: // ... Remaining pipeline
// execute precompiled script
9: }
// many times
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving
k
Serving Optimizations – Batching
14

 Recap: Model Batching (see 08 Data Access) n


 One-pass evaluation of multiple configurations O(m*n)
 EL, CV, feature selection, hyper parameter tuning read
m X O(m*n*k)
 E.g.: TUPAQ [SoCC’16], Columbus [SIGMOD’14
compute
m >> n >> k
 Data Batching
 Batching to utilize the HW more efficiently under SLA
 Use case: multiple users use the same model
n
(wait and collect user request and merge)
 Adaptive: additive increase, multiplicative decrease
X1 Benefits for
m X2 multi-class /
complex
X3 models
[Clipper @
NSDI’17]
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Quantization


15

 Quantization 08 Data Access


 Lossy compression via ultra-low precision / fixed-point Methods
 Ex.: 62.7% energy spent on data movement [Amirali Boroumand et al.: Google
Workloads for Consumer Devices:
Mitigating Data Movement
 Quantization for Model Scoring Bottlenecks. ASPLOS 2018]
 Usually much smaller data types (e.g., UINT8)
 Quantization of model weights, and sometimes also activations
 reduced memory requirements and better latency / throughput (SIMD)

import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()

[Credit: https://www.tensorflow.org/lite/performance/post_training_quantization ]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – MQO


16

 Result Caching
 Establish a function cache for X  Y
(memoization of deterministic function evaluation)

 Multi Model Optimizations


 Same input fed into multiple partially redundant model evaluations
 Common subexpression elimination between prediction programs
 Done during compilation or runtime
 In PRETZEL, programs compiled into
physical stages and registered
with the runtime + caching for stages
(decided based on hashing the inputs)

[Yunseong Lee et al.: PRETZEL: Opening


the Black Box of Machine Learning
Prediction Serving Systems. OSDI 2018]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Compilation


17
04 Adaptation,
Fusion, and JIT
 TensorFlow tf.compile
 Compile entire TF graph into binary function w/ low footprint
 Input: Graph, config (feeds+fetches w/ fixes shape sizes)
[Chris Leary, Todd Wang:
 Output: x86 binary and C++ header (e.g., inference) XLA – TensorFlow, Compiled!,
 Specialization for frozen model and sizes TF Dev Summit 2017]

 PyTorch Compile
 Compile Python functions into ScriptModule/ScriptFunction
 Lazily collect operations,
a = torch.rand(5)
optimize, and JIT compile def func(x):
 Explicit jit.script call for i in range(10):
or @torch.jit.script x = x * x # unrolled into graph
return x
[Vincent Quenneville-Bélair:
How PyTorch Optimizes jitfunc = torch.jit.script(func) # JIT
Deep Learning Computations, jitfunc.save("func.pt")
Guest Lecture Stanford 2020]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Model Vectorization


18

 HummingBird [https://github.com/microsoft/hummingbird] [Supun Nakandala et al: A


 Compile ML scoring pipelines into tensor ops Tensor Compiler for Unified
Machine Learning Prediction
 Tree-based models (GEMM, 2x tree traversal) Serving. OSDI 2020]
input node pred

path ∑ Bucket-class
Bucket paths:
-1 (lhs) / 0 / 1 (rhs) mapping

[Geoffrey E. Hinton, Oriol Vinyals, Jeffrey


Dean: Distilling the Knowledge in a
Neural Network. CoRR 2015]

 Model Distillation
 Ensembles of models  single NN model
 Specialized models for different classes
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
(found via differences to generalist
Matthias Boehm, model)
Graz University of Technology, SS 2021
Model Exchange and Serving

Serving Optimizations – Specialization


19

 NoScope Architecture
 Baseline: YOLOv2 on 1 GPU
per video camera @30fps
 Optimizer to find filters
[Daniel Kang et al: NoScope:
Optimizing Deep CNN-Based
Queries over Video Streams at
Scale. PVLDB 2017]

 #1 Model Specialization
 Given query and baseline model
 Trained shallow NN (based on AlexNet) on output of baseline model
 Short-circuit if prediction with high confidence

 #2 Difference Detection
 Compute difference to ref-image/earlier-frame
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
 Short-circuit w/ ref label
Matthias Boehm,ifGraz
no University
significant difference
of Technology, SS 2021
20

Model Monitoring and Updates


Part of Model Management and MLOps
(see 10 Model Selection & Management)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Model Deployment Workflow


21

Data Integration Model Selection


#1 Model
Data Cleaning Training
Deployment
Data Preparation Hyper-parameters
MX MY B

#2 Continuous Data Validation /


Prediction Concept Drift Detection
Model Serving
Requests

#3 Model
Monitoring
#4 Periodic / Event-based
DevOps
Re-Training & Updates
Engineer
(automatic / semi-manual)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Monitoring Deployed Models


22

 Goals: Robustness (e.g., data, latency) [Neoklis Polyzotis, Sudip Roy, Steven Whang,
Martin Zinkevich: Data Management Challenges in
and model accuracy Production Machine Learning, SIGMOD 2017]

 #1 Check Deviations Training/Serving Data


 Different data distributions, distinct items  impact on model accuracy?
 See 09 Data Acquisition and Preparation (Data Validation)

 #2 Definition of Alerts
 Understandable and actionable During serving:
 Sensitivity for alerts (ignored if too frequent) 0.11?

 #3 Data Fixes “The question is not whether


something is ‘wrong’. The question is
 Identify problematic parts
whether it gets fixed”
 Impact of fix on accuracy
 How to backfill into training data
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Monitoring Deployed Models, cont.


23

 Alert Guidelines [Neoklis Polyzotis, Sudip Roy, Steven Whang,


Martin Zinkevich: Data Management Challenges in
 Make them actionable Production Machine Learning, SIGMOD 2017]
missing field,
less
field has new values,
actionable
distribution changes [George Beskales et al: On the relative
 Question data AND constraints trust between inconsistent data and
inaccurate constraints. ICDE 2013]
 Combining repairs:
[Xu Chu, Ihab F. Ilyas: Qualitative Data
principle of minimality Cleaning. Tutorial, PVLDB 2016]

 Complex Data Lifecycle


 Adding new features to production ML pipelines is a complex process
 Data does not live in a DBMS; data often resides in multiple storage systems
that have different characteristics
 Collecting data for training can be hard and expensive

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift
24
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Recap Concept Drift (features  labels) PAKDD 2011]

 Change of statistical properties / dependencies (features-labels)


 Requires re-training, parametric approaches for deciding when to retrain

 #1 Input Data Changes


 Population change (gradual/sudden), but also new categories, data errors
 Covariance shift p(x) with constant p(y|x)

 #2 Output Data Changes


 Label shift p(y)
 Constant conditional
feature distributed p(x|y)

 Goals: Fast adaptation; noise vs change, recurring contexts, small overhead


706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift, cont.


25
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
 Approach 1: Periodic Re-Training PAKDD 2011]

 Training: window of latest data + data selection/weighting


 Alternatives: incremental maintenance, warm starting, online learning

 Approach 2: Event-based Re-Training


 Change detection (supervised, unsupervised)
 Often model-dependent, specific techniques for time series
 Drift Detection Method: binomial distribution, if error outside scaled
standard-deviation  raise warnings and alters
 Adaptive Windowing (ADWIN): [Albert Bifet, Ricard Gavaldà:
Learning from Time-Changing Data
window W, append data to W, drop with Adaptive Windowing. SDM 2007]
old values until avg windows W=W1-W2
[https://scikitmultiflow.readthedocs.io/en
similar (below epsillon), raise alerts /stable/api/generated/
 Kolmogorov-Smirnov distance / Chi-Squared: skmultiflow.drift_detection.ADWIN.html]
univariate statistical tests training/serving

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

Concept Drift, cont.


26

[Sebastian Schelter, Tammo Rukat, Felix


Bießmann: Learning to Validate the
 Model-agnostic Performance Predictor Predictions of Black Box Classifiers on
Unseen Data. SIGMOD 2020]
 Approach 2: Event-based Re-Training
 User-defined error generators
 Synthetic data corruption  impact on black-box model
 Train performance predictor (regression/classification at threshold t)
for expected prediction quality on percentiles of target variable ŷ

 Results PPM

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

GDPR (General Data Protection Regulation)


27

 GDPR “Right to be Forgotten”


 Recent laws such as GDPR require
companies and institutions to
delete user data upon request
 Personal data must not only be deleted
from primary data stores but also from
ML models trained on it (Recital 75)
[https://gdpr.eu/article-17-right-to-be-forgotten/]

 Example Deanonymization
 Recommender systems: models X ≈ UV

retain user similarly


 Social network data / clustering / KNN
[Sebastian Schelter: "Amnesia" -
 Large language models (e.g., GPT-3) Machine Learning Models That Can
Forget User Data Very Fast. CIDR 2020]

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Model Monitoring and Updates

GDPR, cont.
28 [Sebastian Schelter, Stefan Grafberger, Ted Dunning:
HedgeCut: Maintaining Randomised Trees for Low-
Latency Machine Unlearning, SIGMOD 2021]
 HedgeCut Overview
 Extremely Randomized Trees (ERT):
ensemble of DTs w/ randomized
attributes and cut-off points
 Online unlearning requests < 1ms
w/o retraining for few points

 Handling of Non-robust Splits

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021
Summary and Conclusions
29

 Model Exchange and Serving


 Model Monitoring and Updates

 #1 Finalize Programming Projects by ~June 30


 #2 Oral Exam
 Doodle for July 7/8/9/12/13, 45min each (done via skype/webex)
 Part 1: Describe you programming project, warm-up questions
 Part 2: Questions on 2-3 topics of 11 lectures
(basic understanding of the discussed topics / techniques)

706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving


Matthias Boehm, Graz University of Technology, SS 2021

You might also like