TENSORRT

DEEP LEARNING DEPLOYMENT WITH
NVIDIA TENSORRT
Ashish Sardana | Deep Learning Solutions Architect
Deep Learning in Production
- Current Approaches
- Deployment Challenges
NVIDIA TensorRT
- Programmable Inference Accelerator
- Performance, Optimizations and Features
AGENDA
Example
- Import, Optimize and Deploy
TensorFlow Models with TensorRT
Key Takeaways and Additional Resources
Q&A
2
DEEP LEARNING IN PRODUCTION
Speech Recognition
Recommender Systems
Autonomous Driving
Real-time Object
Recognition
Robotics
Real-time Language
Translation
Many More…
3
CURRENT DEPLOYMENT WORKFLOW
TRAINING UNOPTIMIZED DEPLOYMENT
1
Data Deploy training
Management framework
Training Training
Trained Neural
2
Data Network Deploy custom
application using
Model NVIDIA DL SDK
Assessment
3
Framework or
custom CPU-Only
application
CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)

4
CHALLENGES WITH CURRENT APPROACHES
Requirement Challenges
Unable to processing high-volume, high-velocity data
High Throughput ➢ Impact: Increased cost ($, time) per inference
Applications don’t deliver real-time results

➢ Impact: Negatively affects user experience (voice recognition,
Low Response Time
personalized recommendations, real-time object detection)
Inefficient applications
Power and Memory ➢ Impact: Increased cost (running and cooling), makes deployment
Efficiency infeasible
Deployment-Grade Research frameworks not designed for production

➢ Impact: Framework overhead and dependencies increases time
Solution
to solution and affects productivity
5
NVIDIA TENSORRT
Programmable Inference Accelerator
FRAMEWORKS GPU PLATFORMS
TESLA P4
TensorRT
JETSON TX2
Optimizer Runtime
DRIVE PX 2
NVIDIA DLA
TESLA V100
6
developer.nvidia.com/tensorrt
TENSORRT PERFORMANCE
40x Faster CNNs on V100 vs. CPU-Only 140x Faster Language Translation RNNs on
Under 7ms Latency (ResNet50) V100 vs. CPU-Only Inference (OpenNMT)
40 600 500
6,000 5700 550
450
35
500
5,000 400
30
350
Latency (ms)
Images/sec
400
Latency (ms)
4,000
Images/sec
25
280 ms
300
3,000 20 300 250

14 ms
15 200
2,000 200 153 ms
10 150
6.67 ms 6.83 ms 117 ms
1,000 100
305 5 100
140 50
25
0 0 4
CPU-Only V100 + V100 + TensorRT 0 0
TensorFlow CPU-Only + Torch V100 + Torch V100 + TensorRT
Inference throughput (images/sec) on ResNet50. V100 + TensorRT: NVIDIA TensorRT (FP16), batch size 39, Tesla V100-SXM2- Inference throughput (sentences/sec) on OpenNMT 692M. V100 + TensorRT: NVIDIA TensorRT (FP32), batch size 64, Tesla V100-
16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On V100 + TensorFlow: Preview of volta optimized TensorFlow (FP16), PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. V100 + Torch: Torch (FP32), batch size 4, Tesla V100-PCIE-
batch size 2, Tesla V100-PCIE-16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Intel Xeon-D 1587 16GB, E5-2690 [email protected] 3.5GHz Turbo (Broadwell) HT On. CPU-Only: Torch (FP32), batch size 1, Intel E5-2690 [email protected]
Broadwell-E CPU and Intel DL SDK. Score doubled to comprehend Intel's stated claim of 2x performance improvement on Skylake 3.5GHz Turbo (Broadwell) HT On
with AVX512.
7
TENSORRT DEPLOYMENT WORKFLOW
Step 1: Optimize trained model
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Step 2: Deploy optimized plans with runtime
Plan 1 De-serialize Deploy

Engine Runtime
Plan 2 Data center
Plan 3
Optimized Plans TensorRT Runtime Engine Automotive Embedded 8

MODEL IMPORTING
➢ AI Researchers
➢ Data Scientists
Example: Importing a TensorFlow model

Other Frameworks
Python/C++ API Python/C++ API
Model Importer Network

Definition API
Runtime inference
C++ or Python API
9
TENSORRT LAYERS
Built-in Layer Support Custom Layer API
Deployed Application
• Convolution TensorRT Runtime
• LSTM and GRU
Custom Layer
• Activation: ReLU, tanh, sigmoid
• Pooling: max and average
• Scaling
• Element wise operations
• LRN
• Fully-connected
• SoftMax
• Deconvolution
CUDA Runtime
10
TENSORRT OPTIMIZATIONS
Layer & Tensor Fusion
➢ Optimizations are completely automatic

➢ Performed with a single function call
Weights & Activation
Precision Calibration
Kernel Auto-Tuning
Dynamic Tensor
Memory
11
LAYER & TENSOR FUSION
Un-Optimized Network TensorRT Optimized Network

next input
next input
concat
relu relu relu relu

bias bias bias bias
3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
relu relu
bias bias max pool
1x1 CBR max pool
1x1 conv. 1x1 conv.
input
input
concat
12
LAYER & TENSOR FUSION
Un-Optimized Network TensorRT Optimized Network

• Vertical Fusion
next input
• Horizonal Fusion next input
concat Elimination
• Layer
relu relu relu relu
bias bias bias bias
Network Layers Layers 3x3 CBR 5x5 CBR 1x1 CBR
1x1 conv. 3x3 conv. 5x5 conv. 1x1 conv.
before after
relu relu
VGG19
bias
43bias 27max pool
1x1 CBR max pool
Inception
1x1 conv. 309
1x1 conv. 113
V3
input
ResNet-152 670 159 input
concat
13
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32
38
-3.4x10 ~ +3.4x10
38
Training precision 6,000
FP16
No calibration required Tensor Core
FP16 -65504 ~ +65504 5,000
INT8 -128 ~ +127 Requires calibration
4,000
Images/Second
3,000
Precision calibration for INT8 inference: 2,000

INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100
14
FP16, INT8 PRECISION
CALIBRATION
Reduced Precision Inference Performance
Precision Dynamic Range (ResNet50)
FP32 INT8
Difference 6,000
-3.4x10Top 1 Top 1
38 38
FP32 ~ +3.4x10 Training precision
FP16
Googlenet-65504
FP16 68.87%
~ +65504 68.49% 0.38%
No calibration required Tensor Core
5,000
VGG 68.56% 68.45% 0.11%
Requires calibration
INT8 -128 ~ +127
Resnet-50 73.11% 72.54% 0.57% 4,000
Images/Second
Resnet-152 75.18% 74.56% 0.61%
3,000
Precision calibration for INT8 inference: 2,000

INT8
➢ Minimizes information loss between FP32 and FP32
1,000
INT8 inference on a calibration dataset FP32
FP32
➢ Completely automatic 0
CPU-Only P4 V100
15
KERNEL AUTO-TUNING
DYNAMIC TENSOR MEMORY
Kernel Auto-Tuning Dynamic Tensor Memory
• Reduces memory footprint and

100s for specialized kernels
improves memory re-use
Optimized for every GPU platform
• Manages memory allocation for

each tensor only for the
Multiple parameters: duration of its usage
• Batch size
• Input dimensions
Tesla V100 Jetson TX2 • Filter dimensions
Drive PX2 16
...
TENSORRT DEPLOYMENT WORKFLOW
Plan 1
Import Serialize
Model Engine Plan 2
Plan 3
Trained Neural
Network TensorRT Optimizer Optimized Plans
Plan 1 De-serialize Deploy

Engine Runtime
Plan 2 Data center
Plan 3
Optimized Plans TensorRT Runtime Engine Automotive Embedded 17

EXAMPLE: DEPLOYING TENSORFLOW MODELS
WITH TENSORRT
Deployment and Inference
Import, optimize and deploy
TensorFlow models using TensorRT
python API
New Data
Steps:
• Start with a frozen TensorFlow Trained Neural
Network
model
• Create a model parser
TensorRT
• Optimize model and create a Optimizer Optimized
runtime engine Runtime Engine
• Perform inference using the
optimized runtime engine
Inference Results
18
7 STEPS TO DEPLOYMENT WITH TENSORRT
Step 1: Convert trained model into
TensorRT format
Step 2: Create a model parser
Step 3: Register inputs and outputs
Step 4: Optimize model and create

a runtime engine
Step 5: Serialize optimized engine
Step 6: De-serialize engine

Step 7: Perform inference
RECAP: DEPLOYMENT WORKFLOW
Import Serialize
Model Engine
Plan file
VGG19 FP32, FP16, keras_vgg19_b1_fp32.engine

Batch Size 1

De-serialize Deploy New flower
Engine Runtime images
Plan file
Prediction
keras_vgg19_b1_fp32.engine
TensorRT Runtime Engine Results
20
CHALLENGES ADDRESSED BY TENSORRT
Requirement TensorRT Delivers
Maximizes inference performance on NVIDIA GPUs
High Throughput ➢ INT8, FP16 Precision Calibration, Layer & Tensor Fusion, Kernel
Auto-Tuning
➢ Up to 40x Faster than CPU-Only inference and 18x faster inference
Low Response Time of TensorFlow models
➢ Under 7ms real-time latency
Performs target specific optimizations
Power and Memory ➢ Platform specific kernels for Embedded (Jetson), Datacenter
Efficiency (Tesla GPUs) and Automotive (DrivePX)
➢ Dynamic Tensor Memory management improves memory re-use
Designed for production environments
Deployment-Grade ➢ No framework overhead, minimal dependencies
Solution ➢ Multiple frameworks, Network Definition API
➢ C++, Python API, Customer Layer API
21
TENSORRT PRODUCTION USE CASES
“NVIDIA’s AI platform, using TensorRT software on Tesla GPUs, is the best technology on the
market to support SAP’s requirements for inferencing. TensorRT and NVIDIA GPUs changed
our business model from an offline, next-day service to real-time. We have maximum AI
performance and versatility to meet our customers’ needs, while substantially reducing
energy requirements.”
Source: JUERGEN MUELLER, SAP Chief Innovation Officer
“Real-time execution is very important for self-driving cars. Developing state of the art
perception algorithms normally requires a painful trade-off between speed and accuracy,
but TensorRT brought our ResNet-151 inference time down from 250ms to 89ms.”
Source: Drew Gray – Director of Engineering, UBER ATG
“TensorRT is a real game changer. Not only does TensorRT make model deployment a snap
but the resulting speed up is incredible: out of the box, BodySLAM™, our human pose
estimation engine, now runs over two times faster than using CAFFE GPU inferencing.”
Source: Paul Kruszewski, CEO - WRNCH
22
TENSORRT KEY TAKEAWAYS
✓ Generate optimized, deployment-ready

runtime engines for low latency inference
✓ Import models trained using Caffe or

TensorFlow or use Network Definition API
✓ Deploy in FP32 or reduced precision INT8,

FP16 for higher throughput
✓ Optimize frequently used layers and integrate

user defined custom layers
23
NVIDIA TENSORRT 5 RC NOW AVAILABLE
Volta TensorCore ⚫ TensorFlow Importer ⚫ Python API
Volta TensorCore Import TensorFlow

Support Python API
Models
Data
Scientists
Compiled &
Optimized Model
3.7x faster inference on Tesla Optimize and deploy TensorFlow Improved productivity with easy
V100 vs. Tesla P100 under 7ms models up to 18x faster vs. to use Python API for data
real-time latency TensorFlow framework science workflows
Free download to members of NVIDIA Developer Program

developer.nvidia.com/tensorrt 24
LEARN MORE
PRODUCT PAGE DOCUMENTATION TRAINING

developer.nvidia.com/tensorrt docs.nvidia.com/deeplearning/sdk nvidia.com/dli
25

TENSORRT

Uploaded by

Copyright:

Available Formats

TENSORRT

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TENSORRT

Uploaded by

Copyright:

Available Formats

DEEP LEARNING DEPLOYMENT WITH

Key Takeaways and Additional Resources

CUDA, NVIDIA Deep Learning SDK (cuDNN, cuBLAS, NCCL)

Applications don’t deliver real-time results

Deployment-Grade Research frameworks not designed for production

FRAMEWORKS GPU PLATFORMS

3,000 20 300 250

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Optimized Plans TensorRT Runtime Engine Automotive Embedded 8

Example: Importing a TensorFlow model

Python/C++ API Python/C++ API

Model Importer Network

Built-in Layer Support Custom Layer API

Layer & Tensor Fusion

➢ Optimizations are completely automatic

Un-Optimized Network TensorRT Optimized Network

relu relu relu relu

Un-Optimized Network TensorRT Optimized Network

Precision calibration for INT8 inference: 2,000

Precision calibration for INT8 inference: 2,000

Kernel Auto-Tuning Dynamic Tensor Memory

• Reduces memory footprint and

• Manages memory allocation for

Step 2: Deploy optimized plans with runtime

Plan 1 De-serialize Deploy

Optimized Plans TensorRT Runtime Engine Automotive Embedded 17

Step 3: Register inputs and outputs

Step 4: Optimize model and create

Step 5: Serialize optimized engine

Step 6: De-serialize engine

VGG19 FP32, FP16, keras_vgg19_b1_fp32.engine

Step 2: Deploy optimized plans with runtime

✓ Generate optimized, deployment-ready

✓ Import models trained using Caffe or

✓ Deploy in FP32 or reduced precision INT8,

✓ Optimize frequently used layers and integrate

Volta TensorCore Import TensorFlow

Free download to members of NVIDIA Developer Program

PRODUCT PAGE DOCUMENTATION TRAINING

You might also like