TENSORRT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

DEEP LEARNING DEPLOYMENT WITH

NVIDIA TENSORRT
Ashish Sardana | Deep Learning Solutions Architect
Deep Learning in Production
- Current Approaches
- Deployment Challenges

NVIDIA TensorRT
- Programmable Inference Accelerator
- Performance, Optimizations and Features
AGENDA
Example
- Import, Optimize and Deploy
TensorFlow Models with TensorRT

Key Takeaways and Additional Resources

Q&A
2
DEEP LEARNING IN PRODUCTION
Speech Recognition

Recommender Systems

Autonomous Driving

Real-time Object
Recognition

Robotics

Real-time Language
Translation

Many More…

3
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT

1
Data Deploy training
Management framework

Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment

3
Framework or
custom CPU-Only
application

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)


4
CHALLENGES WITH CURRENT APPROACHES
Requirement Challenges
Unable to processing high-volume, high-velocity data
High Throughput ➢ Impact: Increased cost ($, time) per inference

Applications don’t deliver real-time results


➢ Impact: Negatively affects user experience (voice recognition,
Low Response Time
personalized recommendations, real-time object detection)

Inefficient applications
Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment
Efficiency infeasible

Deployment-Grade Research frameworks not designed for production


➢ Impact: Framework overhead and dependencies increases time
Solution
to solution and affects productivity
5
NVIDIA TENSORRT
Programmable Inference Accelerator

FRAMEWORKS GPU PLATFORMS

TESLA P4

TensorRT
JETSON TX2
Optimizer Runtime

DRIVE PX 2

NVIDIA DLA

TESLA V100

6
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350

Latency (ms)
Images/sec
400

Latency (ms)
4,000
Images/sec

25
280 ms
300

3,000 20 300 250


14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms

1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.

7
developer.nvidia.com/tensorrt
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy


Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 8


MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists

Example: Importing a TensorFlow model


Other Frameworks

Python/C++ API Python/C++ API

Model Importer Network


Definition API

Runtime inference
C++ or Python API

9
developer.nvidia.com/tensorrt
TENSORRT LAYERS

Built-in Layer Support Custom Layer API

Deployed Application
• Convolution TensorRT Runtime
• LSTM and GRU
Custom Layer
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
CUDA Runtime
10
TENSORRT OPTIMIZATIONS

Layer & Tensor Fusion

➢ Optimizations are completely automatic


➢ Performed with a single function call
Weights & Activation
Precision Calibration

Kernel Auto-Tuning

Dynamic Tensor
Memory
11
LAYER & TENSOR FUSION

Un-Optimized Network TensorRT Optimized Network


next input
next input
concat

relu relu relu relu


bias bias bias bias
3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
relu relu
bias bias max pool
1x1 CBR max pool
1x1 conv. 1x1 conv.

input
input
concat

12
LAYER & TENSOR FUSION

Un-Optimized Network TensorRT Optimized Network


• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias bias bias bias
Network Layers Layers 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 309
1x1 conv. 113
V3
input
ResNet-152 670 159 input
concat

13
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32
38
-3.4x10 ~ +3.4x10
38
Training precision 6,000
FP16
No calibration required Tensor Core
FP16 -65504 ~ +65504 5,000
INT8 -128 ~ +127 Requires calibration
4,000

Images/Second
3,000

Precision calibration for INT8 inference: 2,000


INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100

14
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000

Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000

Precision calibration for INT8 inference: 2,000


INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100

15
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and


100s for specialized kernels
improves memory re-use
Optimized for every GPU platform

• Manages memory allocation for


each tensor only for the
Multiple parameters: duration of its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 16
...
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2

Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy


Engine Runtime
Plan 2 Data center

Plan 3

Optimized Plans TensorRT Runtime Engine Automotive Embedded 17


EXAMPLE: DEPLOYING TENSORFLOW MODELS
WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT
python API

New Data
Steps:
• Start with a frozen TensorFlow Trained Neural
Network
model
• Create a model parser
TensorRT
• Optimize model and create a Optimizer Optimized
runtime engine Runtime Engine
• Perform inference using the
optimized runtime engine
Inference Results

18
developer.nvidia.com/tensorrt
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser

Step 3: Register inputs and outputs

Step 4: Optimize model and create


a runtime engine

Step 5: Serialize optimized engine

Step 6: De-serialize engine


Step 7: Perform inference

developer.nvidia.com/tensorrt
RECAP: DEPLOYMENT WORKFLOW
Step 1: Optimize trained model

Import Serialize
Model Engine
Plan file

VGG19 FP32, FP16, keras_vgg19_b1_fp32.engine


Batch Size 1

Step 2: Deploy optimized plans with runtime


De-serialize Deploy New flower
Engine Runtime images
Plan file

Prediction
keras_vgg19_b1_fp32.engine
TensorRT Runtime Engine Results
20
CHALLENGES ADDRESSED BY TENSORRT
Requirement TensorRT Delivers
Maximizes inference performance on NVIDIA GPUs
High Throughput ➢ INT8, FP16 Precision Calibration, Layer & Tensor Fusion, Kernel
Auto-Tuning
➢ Up to 40x Faster than CPU-Only inference and 18x faster inference
Low Response Time of TensorFlow models
➢ Under 7ms real-time latency
Performs target specific optimizations
Power and Memory ➢ Platform specific kernels for Embedded (Jetson), Datacenter
Efficiency (Tesla GPUs) and Automotive (DrivePX)
➢ Dynamic Tensor Memory management improves memory re-use
Designed for production environments
Deployment-Grade ➢ No framework overhead, minimal dependencies
Solution ➢ Multiple frameworks, Network Definition API
➢ C++, Python API, Customer Layer API
21
TENSORRT PRODUCTION USE CASES
“NVIDIA’s AI platform, using TensorRT software on Tesla GPUs, is the best technology on the
market to support SAP’s requirements for inferencing. TensorRT and NVIDIA GPUs changed
our business model from an offline, next-day service to real-time. We have maximum AI
performance and versatility to meet our customers’ needs, while substantially reducing
energy requirements.”
Source: JUERGEN MUELLER, SAP Chief Innovation Officer

“Real-time execution is very important for self-driving cars. Developing state of the art
perception algorithms normally requires a painful trade-off between speed and accuracy,
but TensorRT brought our ResNet-151 inference time down from 250ms to 89ms.”
Source: Drew Gray – Director of Engineering, UBER ATG

“TensorRT is a real game changer. Not only does TensorRT make model deployment a snap
but the resulting speed up is incredible: out of the box, BodySLAM™, our human pose
estimation engine, now runs over two times faster than using CAFFE GPU inferencing.”
Source: Paul Kruszewski, CEO - WRNCH

22
TENSORRT KEY TAKEAWAYS

✓ Generate optimized, deployment-ready


runtime engines for low latency inference

✓ Import models trained using Caffe or


TensorFlow or use Network Definition API

✓ Deploy in FP32 or reduced precision INT8,


FP16 for higher throughput

✓ Optimize frequently used layers and integrate


user defined custom layers

23
NVIDIA TENSORRT 5 RC NOW AVAILABLE
Volta TensorCore ⚫ TensorFlow Importer ⚫ Python API

Volta TensorCore Import TensorFlow


Support Python API
Models

Data
Scientists
Compiled &
Optimized Model

3.7x faster inference on Tesla Optimize and deploy TensorFlow Improved productivity with easy
V100 vs. Tesla P100 under 7ms models up to 18x faster vs. to use Python API for data
real-time latency TensorFlow framework science workflows

Free download to members of NVIDIA Developer Program


developer.nvidia.com/tensorrt 24
LEARN MORE

PRODUCT PAGE DOCUMENTATION TRAINING


developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk nvidia.com/dli

25

You might also like