TENSORRT
TENSORRT
TENSORRT
NVIDIA TENSORRT
Ashish Sardana | Deep Learning Solutions Architect
Deep Learning in Production
- Current Approaches
- Deployment Challenges
NVIDIA TensorRT
- Programmable Inference Accelerator
- Performance, Optimizations and Features
AGENDA
Example
- Import, Optimize and Deploy
TensorFlow Models with TensorRT
Q&A
2
DEEP LEARNING IN PRODUCTION
Speech Recognition
Recommender Systems
Autonomous Driving
Real-time Object
Recognition
Robotics
Real-time Language
Translation
Many More…
3
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT
1
Data Deploy training
Management framework
Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment
3
Framework or
custom CPU-Only
application
Inefficient applications
Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment
Efficiency infeasible
TESLA P4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
6
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
Images/sec
400
Latency (ms)
4,000
Images/sec
25
280 ms
300
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
7
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Plan 3
Runtime inference
C++ or Python API
9
developer.nvidia.com/tensorrt
TENSORRT LAYERS
Deployed Application
• Convolution TensorRT Runtime
• LSTM and GRU
Custom Layer
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
CUDA Runtime
10
TENSORRT OPTIMIZATIONS
Kernel Auto-Tuning
Dynamic Tensor
Memory
11
LAYER & TENSOR FUSION
input
input
concat
12
LAYER & TENSOR FUSION
13
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32
38
-3.4x10 ~ +3.4x10
38
Training precision 6,000
FP16
No calibration required Tensor Core
FP16 -65504 ~ +65504 5,000
INT8 -128 ~ +127 Requires calibration
4,000
Images/Second
3,000
14
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000
Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000
15
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Plan 3
New Data
Steps:
• Start with a frozen TensorFlow Trained Neural
Network
model
• Create a model parser
TensorRT
• Optimize model and create a Optimizer Optimized
runtime engine Runtime Engine
• Perform inference using the
optimized runtime engine
Inference Results
18
developer.nvidia.com/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
developer.nvidia.com/tensorrt
RECAP: DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Import Serialize
Model Engine
Plan file
Prediction
keras_vgg19_b1_fp32.engine
TensorRT Runtime Engine Results
20
CHALLENGES ADDRESSED BY TENSORRT
Requirement TensorRT Delivers
Maximizes inference performance on NVIDIA GPUs
High Throughput ➢ INT8, FP16 Precision Calibration, Layer & Tensor Fusion, Kernel
Auto-Tuning
➢ Up to 40x Faster than CPU-Only inference and 18x faster inference
Low Response Time of TensorFlow models
➢ Under 7ms real-time latency
Performs target specific optimizations
Power and Memory ➢ Platform specific kernels for Embedded (Jetson), Datacenter
Efficiency (Tesla GPUs) and Automotive (DrivePX)
➢ Dynamic Tensor Memory management improves memory re-use
Designed for production environments
Deployment-Grade ➢ No framework overhead, minimal dependencies
Solution ➢ Multiple frameworks, Network Definition API
➢ C++, Python API, Customer Layer API
21
TENSORRT PRODUCTION USE CASES
“NVIDIA’s AI platform, using TensorRT software on Tesla GPUs, is the best technology on the
market to support SAP’s requirements for inferencing. TensorRT and NVIDIA GPUs changed
our business model from an offline, next-day service to real-time. We have maximum AI
performance and versatility to meet our customers’ needs, while substantially reducing
energy requirements.”
Source: JUERGEN MUELLER, SAP Chief Innovation Officer
“Real-time execution is very important for self-driving cars. Developing state of the art
perception algorithms normally requires a painful trade-off between speed and accuracy,
but TensorRT brought our ResNet-151 inference time down from 250ms to 89ms.”
Source: Drew Gray – Director of Engineering, UBER ATG
“TensorRT is a real game changer. Not only does TensorRT make model deployment a snap
but the resulting speed up is incredible: out of the box, BodySLAM™, our human pose
estimation engine, now runs over two times faster than using CAFFE GPU inferencing.”
Source: Paul Kruszewski, CEO - WRNCH
22
TENSORRT KEY TAKEAWAYS
23
NVIDIA TENSORRT 5 RC NOW AVAILABLE
Volta TensorCore ⚫ TensorFlow Importer ⚫ Python API
Data
Scientists
Compiled &
Optimized Model
3.7x faster inference on Tesla Optimize and deploy TensorFlow Improved productivity with easy
V100 vs. Tesla P100 under 7ms models up to 18x faster vs. to use Python API for data
real-time latency TensorFlow framework science workflows
25