12 ModelDeployment
12 ModelDeployment
12 ModelDeployment
SCIENCE
PASSION
TECHNOLOGY
Architecture of ML Systems
12 Model Deployment & Serving
Matthias Boehm
Graz University of Technology, Austria
Computer Science and Biomedical Engineering
Institute of Interactive Systems and Data Science
BMK endowed chair for Data Management
#1 Video Recording
Link in TeachCenter & TUbe (lectures will be public)
https://tugraz.webex.com/meet/m.boehm
Corona traffic light RED May 17: ORANGE Jul 01: YELLOW
#3 Exams
Doodle w/ 42/~50 exam slots (45min each)
July 7/8/9/12/13 (done via skype/webex)
#4 Course Evaluation
Please participate; open period: June 1 – July 15
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Data Science Lifecycle
Exploratory Process
(experimentation, refinements, ML pipelines)
Data/SW DevOps
Engineer Engineer
#1 Embedded ML Serving
TensorFlow Lite and new language bindings (small footprint,
dedicated HW acceleration, APIs, and models: MobileNet, SqueezeNet)
SystemML JMLC (Java ML Connector)
Serverless Computing
10
[Joseph M. Hellerstein et al: Serverless
Computing: One Step Forward, Two
Steps Back. CIDR 2019]
Definition Serverless
FaaS: functions-as-a-service (event-driven, stateless input-output mapping)
Infrastructure for deployment and auto-scaling of APIs/functions
Examples: Amazon Lambda, Microsoft Azure Functions, etc
Lambda Functions
Event Source
Other APIs
(e.g., cloud
and Services
services)
Auto scaling
Pay-per-request
(1M x 100ms = 0.2$)
import com.amazonaws.services.lambda.runtime.Context;
Example
import com.amazonaws.services.lambda.runtime.RequestHandler;
public class MyHandler implements RequestHandler<Tuple, MyResponse> {
@Override
public MyResponse handleRequest(Tuple input, Context context) {
return expensiveModelScoring(input); // with read-only model
} 706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
} Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving
“Model”
M
Challenges
Scoring part of larger end-to-end pipeline Embedded scoring
External parallelization w/o materialization
Simple synchronous scoring Latency ⇒ Throughput
Data size (tiny ΔX, huge model M) Minimize overhead per ΔX
Seamless integration & model consistency
Token inputs & outputs
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Matthias Boehm, Graz University of Technology, SS 2021
Model Exchange and Serving
ΔFX transformapply ΔX
Scoring
ΔFŶ transformdecode ΔŶ
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
tflite_quant_model = converter.convert()
[Credit: https://www.tensorflow.org/lite/performance/post_training_quantization ]
Result Caching
Establish a function cache for X Y
(memoization of deterministic function evaluation)
PyTorch Compile
Compile Python functions into ScriptModule/ScriptFunction
Lazily collect operations,
a = torch.rand(5)
optimize, and JIT compile def func(x):
Explicit jit.script call for i in range(10):
or @torch.jit.script x = x * x # unrolled into graph
return x
[Vincent Quenneville-Bélair:
How PyTorch Optimizes jitfunc = torch.jit.script(func) # JIT
Deep Learning Computations, jitfunc.save("func.pt")
Guest Lecture Stanford 2020]
path ∑ Bucket-class
Bucket paths:
-1 (lhs) / 0 / 1 (rhs) mapping
Model Distillation
Ensembles of models single NN model
Specialized models for different classes
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
(found via differences to generalist
Matthias Boehm, model)
Graz University of Technology, SS 2021
Model Exchange and Serving
NoScope Architecture
Baseline: YOLOv2 on 1 GPU
per video camera @30fps
Optimizer to find filters
[Daniel Kang et al: NoScope:
Optimizing Deep CNN-Based
Queries over Video Streams at
Scale. PVLDB 2017]
#1 Model Specialization
Given query and baseline model
Trained shallow NN (based on AlexNet) on output of baseline model
Short-circuit if prediction with high confidence
#2 Difference Detection
Compute difference to ref-image/earlier-frame
706.550 Architecture of Machine Learning Systems – 12 Model Deployment & Serving
Short-circuit w/ ref label
Matthias Boehm,ifGraz
no University
significant difference
of Technology, SS 2021
20
#3 Model
Monitoring
#4 Periodic / Event-based
DevOps
Re-Training & Updates
Engineer
(automatic / semi-manual)
Goals: Robustness (e.g., data, latency) [Neoklis Polyzotis, Sudip Roy, Steven Whang,
Martin Zinkevich: Data Management Challenges in
and model accuracy Production Machine Learning, SIGMOD 2017]
#2 Definition of Alerts
Understandable and actionable During serving:
Sensitivity for alerts (ignored if too frequent) 0.11?
Concept Drift
24
[A. Bifet, J. Gama, M. Pechenizkiy, I.
Žliobaitė: Handling Concept Drift:
Importance, Challenges & Solutions,
Recap Concept Drift (features labels) PAKDD 2011]
Results PPM
Example Deanonymization
Recommender systems: models X ≈ UV
┬
GDPR, cont.
28 [Sebastian Schelter, Stefan Grafberger, Ted Dunning:
HedgeCut: Maintaining Randomised Trees for Low-
Latency Machine Unlearning, SIGMOD 2021]
HedgeCut Overview
Extremely Randomized Trees (ERT):
ensemble of DTs w/ randomized
attributes and cut-off points
Online unlearning requests < 1ms
w/o retraining for few points