TensorRT Release Notes
TensorRT Release Notes
TensorRT Release Notes
Release Notes
TABLE OF CONTENTS
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | ii
Chapter 1.
TENSORRT OVERVIEW
The core of NVIDIA TensorRT is a C++ library that facilitates high performance inference
on NVIDIA graphics processing units (GPUs). TensorRT takes a trained network, which
consists of a network definition and a set of trained parameters, and produces a highly
optimized runtime engine which performs inference for that network.
TensorRT provides API's via C++ and Python that help to express deep learning models
via the Network Definition API or load a pre-defined model via the parsers that allows
TensorRT to optimize and run them on an NVIDIA GPU. TensorRT applies graph
optimizations, layer fusion, among other optimizations, while also finding the fastest
implementation of that model leveraging a diverse collection of highly optimized
kernels. TensorRT also supplies a runtime that you can use to execute this network on all
of NVIDIA’s GPU’s from the Kepler generation onwards.
TensorRT also includes optional high speed mixed precision capabilities introduced in
the Tegra X1, and extended with the Pascal, Volta, and Turing architectures.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 1
Chapter 2.
TENSORRT RELEASE 6.X.X
‣ New layers:
IResizeLayer
The IResizeLayer implements the resize operation on an input tensor. For more
information, see IResizeLayer: TensorRT API and IResizeLayer: TensorRT
Developer Guide.
IShapeLayer
The IShapeLayer gets the shape of a tensor. For more information, see
IShapeLayer: TensorRT API and IShapeLayer: TensorRT Developer Guide.
PointWise fusion
Multiple adjacent pointwise layers can be fused into a single pointwise layer, to
improve performance. For more information, see the TensorRT Best Practices
Guide.
‣ New operators:
3-dimensional convolution
Performs a convolution operation with 3D filters on a 5D tensor. For
more information, see addConvolutionNd in the TensorRT API and
IConvolutionalLayer in the TensorRT Developer Guide.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 2
TensorRT Release 6.x.x
3-dimensional deconvolution
Performs a deconvolution operation with 3D filters on a 5D tensor. For
more information, see addDeconvolutionNd in the TensorRT API and
IDeconvolutionLayer in the TensorRT Developer Guide.
3-dimensional pooling
Performs a pooling operation with a 3D sliding window on a 5D tensor. For more
information, see addPoolingNd in the TensorRT API and IPoolingLayer in the
TensorRT Developer Guide.
‣ New plugins: Added a persistent LSTM plugin; a half precision persistent LSTM
plugin that supports variable sequence lengths. This plugin also supports bi-
direction, setting initial hidden/cell values, storing final hidden/cell values,
and multi layers. You can use it through the PluginV2 interface, achieves better
performance with small batch sizes, and is currently only supported on Linux. For
more information, see Persistent LSTM Plugin in the TensorRT Developer Guide.
‣ New operators:
TensorFlow
Added ResizeBilinear and ResizeNearest ops.
ONNX
Added Resize op.
For more information, see the full list of Supported Ops in the Support Matrix.
‣ New samples:
sampleDynamicReshape
Added sampleDynamicReshape which demonstrates how to use dynamic input
dimensions in TensorRT by creating an engine for resizing dynamically shaped
inputs to the correct size for an ONNX MNIST model. For more information,
see Working With Dynamic Shapes in the TensorRT Developer Guide, Digit
Recognition With Dynamic Shapes in the TensorRT Samples Support Guide and
the GitHub: ../samples/opensource/sampleDynamicReshape directory.
sampleNvmedia
Added sampleNvmedia which demonstrates how to run NvMedia DLA safe
flows by constructing a network with an Elementwise layer, building a NvMedia
DLA safe engine, and performing inference using safety certified NvMedia DLA
APIs. For more information about NvMedia DLA APIs, see the Developer Guide
of PDK 5.1.9.0 and the GitHub: ../samples/opensource/sampleNvmedia directory.
For more details about the Elementwise layer, see IElementWiseLayer in the
TensorRT API.
sampleReformatFreeIO
Added sampleReformatFreeIO which uses a Caffe model that was trained on
theMNIST dataset and performs engine building and inference using TensorRT.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 3
TensorRT Release 6.x.x
Specifically, it shows how to use reformat free I/O tensors APIs to explicitly
specify I/O formats to TensorFormat::kLINEAR, TensorFormat::kCHW2 and
TensorFormat::kHWC8 for Float16 and INT8 precision. For more information,
see Specifying I/O Formats Using The Reformat Free I/O Tensors APIs in the
TensorRT Samples Support Guide and the GitHub: ../samples/opensource/
sampleReformatFreeIO directory.
sampleUffPluginV2Ext
Added sampleUffPluginV2Ext which implements the custom pooling layer
for the MNIST model (data/samples/lenet5_custom_pool.uff) and
demonstrates how to extend INT8 I/O for a plugin. For more information, see
Adding A Custom Layer That Supports INT8 I/O To Your Network In TensorRT
in the TensorRT Samples Support Guide and the GitHub: ../samples/opensource/
sampleUffPluginV2Ext directory.
sampleNMT
Added sampleNMT which demonstrates the implementation of Neural Machine
Translation (NMT) based on a TensorFlow seq2seq model using the TensorRT
API. The TensorFlow seq2seq model is an open sourced NMT project that uses
deep neural networks to translate text from one language to another language.
For more information, see Neural Machine Translation (NMT) Using A Sequence
To Sequence (seq2seq) Model in the TensorRT Samples Support Guide and
Importing A Model Using The C++ API For Safety in the TensorRT Developer
Guide and the GitHub: ../samples/opensource/sampleNMT directory.
sampleUffMaskRCNN
This sample, sampleUffMaskRCNN, performs inference on the Mask R-CNN
network in TensorRT. Mask R-CNN is based on the Mask R-CNN paper which
performs the task of object detection and object mask predictions on a target
image. This sample’s model is based on the Keras implementation of Mask R-
CNN and its training framework can be found in the Mask R-CNN Github
repository. For more information, see sampleUffMaskRCNN in the TensorRT
Sample Support Guide.
sampleUffFasterRCNN
This sample, sampleUffFasterRCNN, is a UFF TensorRT sample for Faster-RCNN
in NVIDIA Transfer Learning Toolkit SDK. This sample serves as a demo of
how to use pretrained Faster-RCNN model in Transfer Learning Toolkit to do
inference with TensorRT. For more information, see sampleUffFasterRCNN in the
TensorRT Sample Support Guide.
‣ New optimizations:
Dynamic shapes
The size of a tensor can vary at runtime. IShuffleLayer, ISliceLayer, and the new
IResizeLayer now have optional inputs that can specify runtime dimensions.
IShapeLayer can get the dimensions of tensors at runtime, and some layers
can compute new dimensions. For more information, see Working With
Dynamic Shapes and TensorRT Layers in the TensorRT Developer Guide, Digit
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 4
TensorRT Release 6.x.x
Recognition With Dynamic Shapes in the TensorRT Samples Support Guide and
the GitHub: ../samples/opensource/sampleDynamicReshape directory.
Reformat free I/O
Network I/O tensors can be different to linear FP32. Formats of network I/O
tensors now have APIs to be specified explicitly. The removal of reformatting
is beneficial to many applications and specifically saves considerable memory
traffic time. For more information, see Working With Reformat-Free Network I/O
Tensors and Example 4: Add A Custom Layer With INT8 I/O Support Using C++
in the TensorRT Developer Guide.
Layer optimizations
Shuffle operations that are equivalent to identify operations on the underlying
data will be omitted, if the input tensor is only used in the shuffle layer and the
input and output tensors of this layer are not input and output tensors of the
network. TensorRT no longer executes additional kernels or memory copies for
such operations. For more information, see How Does TensorRT Work in the
TensorRT Developer Guide.
New INT8 calibrator
MinMaxCalibrator - Preferred calibrator for NLP tasks. Supports per activation
tensor scaling. Computes scales using per tensor absolute maximum value. For
more information, see INT8 Calibration Using C++.
Explicit precision
You can manually configure a network to be an explicit precision network in
TensorRT. This feature enables users to import pre-quantized models with
explicit quantizing and dequantizing scale layers into TensorRT. Setting the
network to be an explicit precision network implies that you will set the precision
of all the network input tensors and layer output tensors in the network.
TensorRT will not quantize the weights of any layer (including those running in
lower precision). Instead, weights will simply be cast into the required precision.
For more information about explicit precision, see Working With Explicit
Precision Using C++ and Working With Explicit Precision Using Python in the
TensorRT Developer Guide.
‣ Installation:
‣ Added support for RPM and Debian packages for PowerPC users.
Compatibility
‣ TensorRT 6.0.1 has been tested with the following:
‣ cuDNN 7.6.3
‣ TensorFlow 1.14.0
‣ PyTorch 1.1.0
‣ ONNX 1.5.0
‣ This TensorRT release supports CUDA 9.0, 10.0, and 10.1 update 1.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 5
TensorRT Release 6.x.x
Limitations
‣ Upgrading TensorRT to the latest version is only supported when the currently
installed TensorRT version is equal to or newer than the last two public releases. For
example, TensorRT 6.x.x supports upgrading from TensorRT 5.0.x and TensorRT
5.1.x.
‣ Calibration for a network with INT8 I/O tensors requires FP32 calibration data.
Deprecated Features
The following features are deprecated in TensorRT 6.0.1:
Samples changes
‣ The PGM files for the MNIST samples have been removed. A script, called
generate_pgms.py, has been provided in the samples/mnist/data directory
to generate the images using the dataset.
‣ --useDLACore=0 is no longer a valid option for sampleCharRNN as the DLA
does not support FP32 or RNN’s, and the sample is only written to work with
FP32 in all cases.
Fixed Issues
‣ Logging level Severity::kVERBOSE is now fully supported. Log messages with
this level of severity are verbose messages with debugging information.
‣ Deconvolution layer with stride > 32 is now supported on DLA.
‣ Deconvolution layer with kernel size > 32 is now supported on DLA.
Known Issues
‣ For Ubuntu 14.04 and CentOS7, in order for ONNX, TensorFlow and TensorRT to
co-exist in the same environment, ONNX and TensorFlow must be built from source
using your system's native compilers. It’s especially important to build ONNX and
TensorFlow from source when using the IBM Anaconda channel for PowerPC to
avoid compatibility issues with pybind11 and protobuf.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 6
Chapter 3.
TENSORRT RELEASE 5.X.X
Compatibility
‣ TensorRT 5.1.5 has been tested with the following:
‣ cuDNN 7.5.0
‣ TensorFlow 1.12.0
‣ PyTorch 1.0
‣ ONNX 1.4.1
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 7
TensorRT Release 5.x.x
‣ This TensorRT release supports CUDA 9.0, CUDA 10.0, and CUDA 10.1.
Deprecated Features
The following features are deprecated in TensorRT 5.1.5:
Known Issues
‣ For Ubuntu 14.04 and CentOS7, there is a known bug when trying to import
TensorRT and ONNX Python modules together due to different compiler versions
used to generate their respective Python bindings. As a work around, build the
ONNX module from source using your system's native compilers.
‣ You may see the following warning when running programs linked with TensorRT
5.1.5 and CUDA 10.1 libraries:
[W] [TRT] TensorRT was compiled against cuBLAS 10.2.0 but is linked against
cuBLAS 10.1.0.
You can resolve this by updating your CUDA 10.1 installation to 10.1 update 1 here.
‣ There is a known issue in sample yolov3_onnx with ONNX versions > 1.4.1. To work
around this, install version 1.4.1 of ONNX through:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 8
TensorRT Release 5.x.x
Asymmetric padding
Compatibility
‣ TensorRT 5.1.3 has been tested with the following:
‣ cuDNN 7.5.0
‣ TensorFlow 1.12.0
‣ PyTorch 1.0
‣ ONNX 1.4.1
‣ TensorRT will now emit a warning when the major, minor, and patch versions
of cuDNN and cuBLAS do not match the major, minor, and patch versions that
TensorRT is expecting.
Limitations
‣ For CentOS and RHEL users, when choosing Python 3:
‣ Only Python version 3.6 from EPEL is supported by the RPM installation.
‣ Only Python versions 3.4 and 3.6 from EPEL are supported by the tar
installation.
‣ In order to run the UFF converter and its related C++ and Python samples on
PowerPC, it’s necessary to install TensorFlow for PowerPC. For more information,
see Install TensorFlow on Power systems.
‣ In order to run the PyTorch samples on PowerPC, it’s necessary to install PyTorch
specifically built for PowerPC, which is not available from PyPi. For more
information, see Install PyTorch on Power systems.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 9
TensorRT Release 5.x.x
Deprecated Features
The following features are deprecated in TensorRT 5.1.3:
‣ sampleNMT has been removed from the TensorRT package. The public data source
files have changed and no longer work with the sample.
Fixed Issues
The following issues have been resolved in TensorRT 5.1.3:
‣ Fixed the behavior of the Caffe crop layer when the layer has an asymmetric crop
offset.
‣ ITensor::getType() and ILayer::getOutputType() now report the type
correctly. Previously, both types reported DataType::kFLOAT even if the output
type should have been DataType::kINT32. For example, the output type of
IConstantLayer with DataType::kINT32 weights is now correctly reported as
DataType::kINT32. The affected layers include:
Known Issues
‣ When running ShuffleNet with small batch sizes between 1 and 4, you may
encounter performance regressions of up to 15% compared to TensorRT 5.0.
‣ When running ResNeXt101 with a batch size of 4 using INT8 precision on a Volta
GPU, you may encounter intermittent performance regressions of up to 10%
compared to TensorRT 5.0. Rebuilding the engine may resolve this issue.
‣ There is a known issue in sample yolov3_onnx with ONNX versions > 1.4.1. To work
around this, install version 1.4.1 of ONNX through:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 10
TensorRT Release 5.x.x
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 11
TensorRT Release 5.x.x
timeline. NVIDIA Nsight Compute also supports collecting and displaying the state
of all active NVTX domains and ranges in a given thread when the application is
suspended.
New layer
Added support for the Slice layer. The Slice layer implements a slice operator for
tensors. For more information, see ISliceLayer.
RNNs
Changed RNNv1 and RNNv2 validation of hidden and cell input/output dimensions.
This affects only bidirectional RNNs.
EntropyCalibrator2
Added Entropy Calibration algorithm; which is the preferred calibrator.
Python support
Python 3 is now supported for CentOS and RHEL users. The Python 3 wheel files
have been split so that each wheel file now contains the Python bindings for only one
Python version and follows pip naming conventions.
New Python samples
Compatibility
‣ TensorRT 5.1.2 RC has been tested with the following:
‣ cuDNN 7.5.0
‣ TensorFlow 1.12.0
‣ PyTorch 1.0
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 12
TensorRT Release 5.x.x
‣ This TensorRT release supports CUDA 9.0, CUDA 10.0 and CUDA 10.1.
Limitations
‣ A few optimizations are disabled when building refittable engines:
‣ IScaleLayer operations that have non-zero count of weights for shift or scale
and are mathematically the identity function will not be removed, since a refit of
the shift or scale weights could make it a non-identity function. IScaleLayer
operations where the shift and scale weights have zero count are still removed if
the power weights are unity.
‣ Optimizations for multilayer perceptrons are disabled. These
optimizations target serial compositions of IFullyConnectedLayer,
IMatrixMultiplyLayer, and IActivationLayer.
Deprecated Features
The following features are deprecated in TensorRT 5.1.2 RC:
‣ The UFF Parser which is used to parse a network in UFF format will be deprecated
in a future release. The recommended method of importing TensorFlow models
to TensorRT is using TensorFlow with TensorRT (TF-TRT). For step-by-step
instructions on how to accelerate inference in TF-TRT, see the TF-TRT User Guide
and Release Notes. For source code from GitHub, see Examples for TensorRT in
TensorFlow (TF-TRT).
Known Issues
‣ Using the current public data sources, sampleNMT produces incorrect results which
results in a low BLEU score. This sample will be removed in the next release so that
we can update the source code to work with the latest public data.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 13
TensorRT Release 5.x.x
‣ Python sample yolov3_onnx is functional only for ONNX versions greater than 1.1.0
and less than 1.4.0.
‣ CUDA 10.1 is now supported. For more information, see the CUDA 10.1 Release
Notes.
Compatibility
‣ TensorRT 5.1.1 RC has been tested with the following:
‣ cuDNN 7.5.0
Limitations
‣ The Python API is not included in this package.
Known Issues
‣ When linking against CUDA 10.1, performance regressions may occur under Drive
5.0 QNX and Drive 5.0 Linux because of a regression in cuBLAS. This affects the
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 14
TensorRT Release 5.x.x
FullyConnected layers in AlexNet, VGG19, and ResNet-50 for small batch sizes
(between 1 and 4).
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 15
TensorRT Release 5.x.x
Cos, Tan, Asin, Acos, Atan, Sinh, Cosh, Asinh, Acosh, Atanh, Ceil,
Floor, ScaledTanh, Softsign, Slice, ThresholdedRelu and Unsqueeze
ops.
For more information, see the TensorRT Support Matrix.
NVTX support
NVIDIA Tools Extension SDK (NVTX) is a C-based API for marking events and
ranges in your applications. NVTX annotations were added in TensorRT to help
correlate the runtime engine layer execution with CUDA kernel calls. NVIDIA
Nsight Systems supports collecting and visualizing these events and ranges on the
timeline. NVIDIA Nsight Compute also supports collecting and displaying the state
of all active NVTX domains and ranges in a given thread when the application is
suspended.
New layer
Added support for the Slice layer. The Slice layer implements a slice operator for
tensors. For more information, see ISliceLayer.
RNNs
Changed RNNv1 and RNNv2 validation of hidden and cell input/output dimensions.
This affects only bidirectional RNNs.
EntropyCalibrator2
Added Entropy Calibration algorithm; which is the preferred calibrator. This is
also the required calibrator for DLA INT8 because it supports per activation tensor
scaling.
ILogger
Added verbose severity level in ILogger for emitting debugging messages. Some
messages that were previously logged with severity level kINFO are now logged
with severity level kVERBOSE. Added new ILogger derived class in samples and
trtexec. Most messages should be categorized (using the severity level) as:
[V]
For verbose debug informational messages.
[I]
For "instructional" informational messages.
[W]
For warning messages.
[E]
For error messages.
[F]
For fatal error messages.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 16
TensorRT Release 5.x.x
Python
Python bindings
Added Python bindings to the aarch64-gnu release package (debian and tar).
RPM installation
Provided installation support for Red Hat Enterprise Linux (RHEL) and CentOS users
to upgrade from TensorRT 5.0.x to TensorRT 5.1.x. For more information, see the
upgrading instructions in the Installation Guide.
Compatibility
‣ TensorRT 5.1.0 RC has been tested with cuDNN 7.3.1.
Limitations
‣ A few optimizations are disabled when building refittable engines.
‣ IScaleLayer operations that have non-zero count of weights for shift or scale
and are mathematically the identity function will not be removed, since a refit of
the shift or scale weights could make it a non-identity function. IScaleLayer
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 17
TensorRT Release 5.x.x
operations where the shift and scale weights have zero count are still removed if
the power weights are unity.
‣ Optimizations for multilayer perceptrons are disabled. These
optimizations target serial compositions of IFullyConnectedLayer,
IMatrixMultiplyLayer, and IActivationLayer.
‣ DLA limitations
‣ local_size = 5
‣ alpha = 0.0001
‣ beta = 0.75
‣ INT8 LRN, Sigmoid, and Tanh are not supported.
For more information, see DLA Supported Layers.
Deprecated Features
The following features are deprecated in TensorRT 5.1.0 RC:
Known Issues
‣ When the tensor size is too large, such as a single tensor that has more than
4G elements, overflow may occur which will cause TensorRT to crash. As a
workaround, you may need to reduce the batch size.
‣ Python support for AArch64 Linux is included as an early access release. All
features are expected to be available, however, some aspects of functionality and
performance will likely be limited compared to a non-EA release.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 18
TensorRT Release 5.x.x
‣ The UFF parser’s memory usage was significantly reduced to better accommodate
boards with small amounts of memory.
Compatibility
‣ TensorRT 5.0.6 has been tested with the following:
‣ cuDNN 7.3.1
‣ TensorFlow 1.12
‣ PyTorch 1.0
‣ This TensorRT release supports CUDA 10.0.
Known Issues
‣ The default workspace size for sampleUffSSD is 1 GB. This may be too large for the
Jetson TX1 NANO, therefore, change the workspace for the builder in the source file
via the following code:
builder->setMaxWorkspaceSize(16_MB);
‣ In order to run larger networks or larger batch sizes with TensorRT, it may be
necessary to free memory on the board. This can be accomplished by running in
headless mode or killing processes with high memory consumption.
‣ Due to limited system memory on the Jetson TX1 NANO, which is shared
between the CPU and GPU, you may not be able run some samples, for example,
sampleFasterRCNN.
‣ Python sample yolov3_onnx is functional only for ONNX versions greater than
1.1.0 and less than 1.4.0.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 19
TensorRT Release 5.x.x
Compatibility
‣ TensorRT 5.0.5 supports CUDA 10.0
‣ TensorRT 5.0.5 supports cuDNN 7.3.1
‣ TensorRT 5.0.5 supports the Android platform with API level 26 or higher
Limitations In 5.0.5
‣ TensorRT 5.0.5.1 supports DLA while TensorRT 5.0.5.0 does not.
Known Issues
‣ For TensorRT 5.0.5.0, some sample programs have --useDLACore in their command
line arguments, however, do not use it because this release does not support DLA.
‣ When running trtexec from a saved engine, the --output and --input
command line arguments are mandatory. For example:
‣ When running applications that use DLA on Xavier based platforms that also
contain a discrete GPU (dGPU), you may be required to select the integrated GPU
(iGPU). This can be done using the following command:
export CUDA_VISIBLE_DEVICES=1
‣ Two new samples showcasing ONNX model parsing functionality have been added:
‣ sampleOnnxMNIST
‣ sampleINT8API
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 20
TensorRT Release 5.x.x
Compatibility
‣ TensorRT 5.0.4 supports Windows 10
‣ TensorRT 5.0.4 supports CUDA 10.0 and CUDA 9.0
‣ TensorRT 5.0.4 supports CUDNN 7.3.1
‣ TensorRT 5.0.4 supports Visual Studio 2017
Limitations In 5.0.4
‣ TensorRT 5.0.4 does not support Python API on Windows.
Known Issues
‣ NVIDIA’s Windows display driver sets timeout detection recovery to 2 seconds by
default. This can cause some timeouts within TensorRT’s builder and cause crashes.
For more information, see Timeout Detection & Recovery (TDR) to increase the
default timeout threshold if you encounter this problem.
‣ TensorRT Windows performance is slower than Linux due to the operating system
and driver differences. There are two driver modes:
‣ Most README files that are included with the samples assume that you
are working on a Linux workstation. If you are using Windows and do not
have access to a Linux system with an NVIDIA GPU, then you can try using
VirtualBox to create a virtual machine based on Ubuntu. You may also want to
consider using a Docker container for Ubuntu. Many samples do not require
any training, therefore the CPU versions of TensorFlow and PyTorch are enough
to complete the samples.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 21
TensorRT Release 5.x.x
work around this, open list.txt in a text editor and ensure that the file is
using Unix-style line endings.
‣ For this TensorRT release, JetPack L4T and Drive D5L are supported by a single
package.
See the TensorRT Developer Guide for details.
Compatibility
TensorRT 5.0.3 supports the following product versions:
‣ CUDA 10.0
‣ cuDNN 7.3.1
‣ NvMedia DLA version 2.2
‣ NvMedia VPI Version 2.3
Known Issues
‣ For multi-process execution, and specifically when executing multiple inference
sessions in parallel (for example, of trtexec) target different accelerators, you may
observe a performance degradation if cudaEventBlockingSync is used for stream
synchronization.
One way to work around this performance degradation is to use the
cudaEventDefault flag when creating the events which internally uses the spin-
wait synchronization mechanism. In trtexec, the default behavior is to use blocking
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 22
TensorRT Release 5.x.x
events, but this can be overridden with the --useSpinWait option to specify spin-
wait based synchronization.
For more information about CUDA blocking sync semantics, refer to Event
Management.
‣ There is a known issue when attempting to cross compile samples for mobile
platforms on an x86_64 host machine. As cross-platform CUDA packages
are structured differently, the following changes are required for samples/
Makefile.config when compiling cross platform.
Line 80
Add:
-L"$(CUDA_INSTALL_DIR)/targets/$(TRIPLE)/$(CUDA_LIBDIR)/stubs"
Line 109
Remove:
-lnvToolsExt
Turing
You must use CUDA 10.0 or later if you are using a Turing GPU.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 23
TensorRT Release 5.x.x
name in the upper right corner, click My account > My Bugs and select Submit a
New Bug.
The trtexec tool can be used to run on DLA with the --useDLACore=N where N is 0
or 1, and --fp16 options. To run the MNIST network on DLA using trtexec, issue:
INT8
Support has been added for user-defined INT8 scales, using the new
ITensor::setDynamicRange function. This makes it possible to define dynamic
range for INT8 tensors without the need for a calibration data set. setDynamicRange
currently supports only symmetric quantization. A user must either supply a
dynamic range for each tensor or use the calibrator interface to take advantage of
INT8 support.
Plugin Registry
A new searchable plugin registry, IPluginRegistry, is a single registration point
for all plugins in an application and is used to find plugin implementations during
deserialization.
C++ Samples
sampleSSD
This sample demonstrates how to perform inference on the Caffe SSD network
in TensorRT, use TensorRT plugins to speed up inference, and perform INT8
calibration on an SSD network. To generate the required prototxt file for this
sample, perform the following steps:
1. Download models_VGGNet_VOC0712_SSD_300x300.tar.gz from: https://
drive.google.com/file/d/0BzKzrI_SkD1_WVVTSmQxU0dVRzA/view
2. Extract the contents of the tar file;
tar xvf
~/Downloads/models_VGGNet_VOC0712_SSD_300x300.tar.gz
3. Edit the deploy.prototxt file and change all the Flatten layers to Reshape
operations with the following parameters:
reshape_param {
shape {
dim: 0
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 24
TensorRT Release 5.x.x
dim: -1
dim: 1
dim: 1
}
4. Update the detection_out layer by adding the keep_count output, for
example, add:
top: "keep_count"
5. Rename the deploy.prototxt file to ssd.prototxt and run the sample.
6. To run the sample in INT8 mode, install Pillow first by issuing the $ pip
install Pillow command, then follow the instructions from the README.
sampleINT8API
This sample demonstrates how to perform INT8 Inference using per-tensor
dynamic range. To generate the required input data files for this sample, perform
the following steps:
Running the sample:
1. Download the Model files from GitHub, for example:
wget https://s3.amazonaws.com/download.onnx/models/opset_3/
resnet50.tar.gz
2. Unzip the tar file:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 25
TensorRT Release 5.x.x
Python Samples
yolov3_onnx
This sample demonstrates a full ONNX-based pipeline for inference with the
network YOLOv3-608, including pre- and post-processing.
uff_ssd
This sample demonstrates a full UFF-based inference pipeline for performing
inference with an SSD (InceptionV2 feature extractor) network.
IPluginV2
A plugin class IPluginV2 has been added together with a corresponding IPluginV2
layer. The IPluginV2 class includes similar methods to IPlugin and IPluginExt,
so if your plugin implemented IPluginExt previously, you will change the class
name to IPluginV2. The IPlugin and IPluginExt interfaces are to be deprecated
in the future, therefore, moving to the IPluginV2 interface for this release is strongly
recommended.
The specific DLA core to execute the engine on can be set by the following methods:
IBuilder::setDLACore(int dlaCore)
IRuntime::setDLACore(int dlaCore)
The following methods have been added to get the DLA core set on IBuilder or
IRuntime objects:
int IBuilder::getDLACore()
int IRuntime::getDLACore()
Another API has been added to query the number of accessible DLA cores as
follows:
int IBuilder::getNbDLACores()
Int IRuntime::getNbDLACores()
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 26
TensorRT Release 5.x.x
Compatibility
‣ TensorRT 5.0.2 has been tested with cuDNN 7.3.1.
‣ This TensorRT release supports CUDA 10.0 and CUDA 9.0. CUDA 8.0 and CUDA
9.2 are no longer supported. On Windows only, CUDA 10.0 is supported for
TensorRT 5.0.1 RC.
Limitations In 5.0.2
‣ TensorRT 5.0.2 does not include support for DLA with the INT8 data type. Only
DLA with the FP16 data type is supported by TensorRT at this time. DLA with INT8
support is planned for a future TensorRT release.
‣ The ONNX parser is not supported on Windows 10. This includes all samples which
depend on the ONNX parser. ONNX support will be added in a future release.
‣ Tensor Cores supporting INT4 were first introduced with Turing GPUs. This release
of TensorRT 5.0 does not support INT4.
‣ The yolov3_onnx Python sample is not supported on Ubuntu 14.04 and earlier.
‣ The Leaky ReLU plugin (LReLU_TRT) allows for only a parameterized slope on a per
tensor basis.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 27
TensorRT Release 5.x.x
Deprecated Features
The following features are deprecated in TensorRT 5.0.2:
‣ The majority of the old Python API, including the Lite and Utils API, are deprecated.
It is currently still accessible in the tensorrt.legacy package, but will be removed
in a future release.
‣ caffe_to_trt
‣ pytorch_to_trt
‣ tf_to_trt
‣ onnx_mnist
‣ uff_mnist
‣ mnist_api
‣ sample_onnx
‣ googlenet
‣ custom_layers
‣ lite_examples
‣ resnet_as_a_service
‣ The old ONNX parser will no longer be packaged with TensorRT; instead, use the
open-source ONNX parser.
‣ The plugin APIs that return INvPlugin are being deprecated and they now
return IPluginV2. These APIs will be removed in a future release. Refer to
NvInferPlugin.h inside the TensorRT package.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 28
TensorRT Release 5.x.x
Known Issues
‣ Only AlexNet, GoogleNet, ResNet-50, and MNIST are known to work with DLA.
Other networks may work, but they have not been extensively tested.
‣ For this TensorRT release, there are separate JetPack L4T and Drive D5L packages
due to differences in the DLA library dependencies. In a future release, this should
become unified.
‣ The C++ API documentation is not included in the TensorRT zip file. Refer to the
online documentation if you want to view the TensorRT C++ API.
‣ Most README files that are included with the samples assume that you are
working on a Linux workstation. If you are using Windows and do not have access
to a Linux system with an NVIDIA GPU, then you can try using VirtualBox to create
a virtual machine based on Ubuntu. Many samples do not require any training,
therefore the CPU versions of TensorFlow and PyTorch are enough to complete the
samples.
‣ The TensorRT Developer Guide has been written with Linux users in mind.
Windows specific instructions, where possible, will be added in a future revision of
the document.
‣ To create a valid UFF file for sampleMovieLensMPS, the correct command is:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 29
TensorRT Release 5.x.x
‣ The trtexec tool does not currently validate command-line arguments. If you
encounter failures, double check the command-line parameters that you provided.
Turing
You must use CUDA 10.0 or later if you are using a Turing GPU.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 30
TensorRT Release 5.x.x
INT8
Support for user-defined INT8 scales, using the new ITensor::setDynamicRange
function. This makes it possible to provide custom INT8 calibration without the need
for a calibration data set. setDynamicRange currently supports only symmetric
quantization. Furthermore, if no calibration table is provided, calibration scales must
be provided for each layer.
Plugin Registry
A new searchable plugin registry, IPluginRegistry, that is a single registration
point for all plugins in an application and is used to find plugin implementations
during deserialization.
sampleSSD
This sample demonstrates how to preprocess the input to the SSD network, perform
inference on the SSD network in TensorRT, use TensorRT plugins to speed up
inference, and perform INT8 calibration on an SSD network.
See the TensorRT Developer Guide for details.
Compatibility
‣ TensorRT 5.0.1 RC has been tested with cuDNN 7.3.0.
‣ TensorRT 5.0.1 RC for Windows has been tested with Visual Studio 2017.
‣ This TensorRT release supports CUDA 10.0 and CUDA 9.0. CUDA 8.0 and CUDA
9.2 are no longer supported. On Windows only, CUDA 10.0 is supported for
TensorRT 5.0.1 RC.
Limitations In 5.0.1 RC
‣ For this release, there are separate JetPack L4T and Drive D5L packages due to
differences in the DLA library dependencies. In a future release, this should become
unified.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 31
TensorRT Release 5.x.x
‣ The choice of which DLA device to run on is currently made at build time. In GA, it
will be selectable at runtime.
‣ Python is not supported on Windows 10. This includes the graphsurgeon and UFF
Python modules.
‣ The ONNX parser is not supported on Windows 10. This includes all samples which
depend on the ONNX parser. ONNX support will be added in a future release.
Deprecated Features
The following features are deprecated in TensorRT 5.0.1 RC:
‣ Majority of the old Python API, including the Lite and Utils API, is deprecated. It is
currently still accessible in the tensorrt.legacy package, but will be removed in a
future release.
‣ caffe_to_trt
‣ pytorch_to_trt
‣ tf_to_trt
‣ onnx_mnist
‣ uff_mnist
‣ mnist_api
‣ sample_onnx
‣ googlenet
‣ custom_layers
‣ lite_examples
‣ resnet_as_a_service
‣ The old ONNX parser will no longer be packaged with TensorRT; instead, use the
open-source ONNX parser.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 32
TensorRT Release 5.x.x
‣ The plugin APIs that return IPlugin are being deprecated and they now return
IPluginExt. These APIs will be removed in a future release. Refer to the
NvInferPlugin.h file inside the package.
Known Issues
‣ The Plugin Registry will only register plugins with a unique {name, version}
tuple. The API for this is likely to change in future versions to support multiple
plugins with same name and version.
‣ Only AlexNet, GoogleNet, ResNet-50, and MNIST are known to work with DLA.
Other networks may work, but they have not been extensively tested.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 33
TensorRT Release 5.x.x
‣ The C++ API documentation is not included in the TensorRT zip file. Refer to the
online documentation if you want to view the TensorRT C++ API.
‣ Most README files that are included with the samples assume that you are
working on a Linux workstation. If you are using Windows and do not have access
to a Linux system with an NVIDIA GPU, then you can try using VirtualBox to create
a virtual machine based on Ubuntu. Many samples do not require any training,
therefore the CPU versions of TensorFlow and PyTorch are enough to complete the
samples.
‣ The TensorRT Developer Guide has been written with Linux users in mind.
Windows specific instructions, where possible, will be added in a future revision of
the document.
Turing
You must use CUDA 10.0 or later if you are using a Turing GPU.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 34
TensorRT Release 5.x.x
The trtexec tool can be used to run on DLA with the --useDLA=N and --fp16
options. To run the AlexNet network on DLA using trtexec, issue:
INT8
Support for user-defined INT8 scales, using the new ITensor::setDynamicRange
function. This makes it possible to provide custom INT8 calibration without the need
for a calibration data set. setDynamicRange currently supports only symmetric
quantization. Furthermore, if no calibration table is provided, calibration scales must
be provided for each layer.
Plugin Registry
A new searchable plugin registry, IPluginRegistry, that is a single registration
point for all plugins in an application and is used to find plugin implementations
during deserialization.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 35
TensorRT Release 5.x.x
Compatibility
‣ TensorRT 5.0.0 RC has been tested with cuDNN 7.3.0.
‣ This TensorRT release supports CUDA 10.0 and CUDA 9.0. CUDA 8.0 and CUDA
9.2 are no longer supported.
Limitations In 5.0.0 RC
‣ For this release, there are separate JetPack L4T and Drive D5L packages due to
differences in the DLA library dependencies. In a future release, this should become
unified.
‣ The choice of which DLA device to run on is currently made at build time. In GA, it
will be selectable at runtime.
Deprecated Features
The following features are deprecated in TensorRT 5.0.0:
‣ Majority of the old Python API, including the Lite and Utils API, is deprecated. It is
currently still accessible in the tensorrt.legacy package, but will be removed in a
future release.
‣ caffe_to_trt
‣ pytorch_to_trt
‣ tf_to_trt
‣ onnx_mnist
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 36
TensorRT Release 5.x.x
‣ uff_mnist
‣ mnist_api
‣ sample_onnx
‣ googlenet
‣ custom_layers
‣ lite_examples
‣ resnet_as_a_service
‣ The old ONNX parser will no longer be packaged with TensorRT; instead, use the
open-source ONNX parser.
‣ The plugin APIs that return IPlugin are being deprecated and they now return
IPluginExt. These APIs will be removed in a future release. Refer to the
NvInferPlugin.h file inside the package.
Known Issues
‣ The Plugin Registry will only register plugins with a unique {name, version}
tuple. The API for this is likely to change in future versions to support multiple
plugins with same name and version.
‣ Only AlexNet, GoogleNet, ResNet-50, and MNIST are known to work with DLA.
Other networks may work, but they have not been extensively tested.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 37
TensorRT Release 5.x.x
missing from the package in order to complete static linking. The two static
libraries that are required to complete linking are libonnx_proto.a and
libnvonnxparser_plugin.a, as well as the protobuf libraries mentioned earlier.
You will need to build these two missing static libraries from the open source ONNX
project. This issue will be resolved in a future release.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 38
Chapter 4.
TENSORRT RELEASE 4.X.X
‣ TensorRT 4.0.1 GA has been tested with cuDNN 7.1.3 and now requires cuDNN
7.1.x.
‣ Support for ONNX 1.0 (Open Neural Network Exchange) has been implemented.
ONNX is a standard for representing deep learning models that enable models to be
transferred between frameworks. TensorRT can now parse the network definitions
in ONNX format, in addition to NVCaffe and UFF formats.
‣ The Custom Layer API now supports user-defined layers that take half precision, or
FP16, inputs and return FP16 outputs.
‣ Added support for the MatrixMultiply, Constant, Gather, Ragged SoftMax, Reduce,
RNNv2 and TopK layers (for K up to 25).
‣ This release has optimizations which target recommender systems like Neural
Collaborative Filtering.
‣ Many layers now support the ability to broadcast across the batch dimension.
‣ In TensorRT 3.0, INT8 had issues with rounding and striding in the Activation layer.
This may have caused INT8 accuracy to be low. Those issues have been fixed.
‣ The C++ samples and Python examples were tested with TensorFlow 1.8 and
PyTorch 0.4.0 where applicable.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 39
TensorRT Release 4.x.x
‣ Added sampleUffSSD. This sample converts the TensorFlow Single Shot MultiBox
Detector (SSD) network to a UFF format and runs it on TensorRT using plugins. This
sample also demonstrates how other TensorFlow networks can be preprocessed and
converted to UFF format with support of custom plugin nodes.
‣ Applications may now provide their own memory for activations and
workspace during inference, which is used only while the pipeline is running.
‣ An allocator callback is available for all memory allocated on the GPU. In
addition, model deserialization is significantly faster (from system memory, up
to 10x faster on large models).
‣ The sample utility giexec has been renamed to trtexec to be consistent with the
product name, TensorRT, which is often shortened to TRT. A compatibility script for
users of giexec has been included to help users make the transition.
Deprecated Features
‣ The RNN layer type is deprecated in favor of RNNv2, however, it is still available
for backwards compatibility.
‣ Legacy GIE version defines in NvInfer.h have been removed. They were
NV_GIE_MAJOR, NV_GIE_MINOR, NV_GIE_PATCH, and NV_GIE_VERSION. The correct
alternatives are NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH,
and NV_TENSORRT_VERSION which existed in TensorRT 3.0.4 as well.
‣ Dimension types are now ignored in the API, however, they are still available for
backwards compatibility.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 40
TensorRT Release 4.x.x
Known Issues
‣ If the ONNX parser included with TensorRT is unable to parse your model, then try
updating to the latest open source ONNX parser, which may resolve your issue.
‣ PyTorch no longer supports Python 3.4 with their current release (0.4.0). Therefore,
the TensorRT PyTorch examples will not work when using Python 3 on Ubuntu
14.04.
‣ Reshape to a tensor that has a larger number of dimensions than the input tensor is
not supported.
‣ Reformat has a known memory overwrite issue on Volta when FP16 is used with the
Concatenation layer and the Reformat layer.
‣ If you have two different CUDA versions of TensorRT installed, such as CUDA 8.0
and CUDA 9.0, or CUDA 9.2 using local repos, then you will need to execute an
additional command to install the CUDA 8.0 version of TensorRT and prevent it
from upgrading to the CUDA 9.0 or CUDA 9.2 versions of TensorRT.
‣ sampleNMT
‣ sampleUffSSD
‣ Some precision loss was observed while running the network in INT8 mode,
causing some objects to go undetected in the image. Our general observation is
that having at least 500 images for calibration is a good starting point.
‣ Performance regressions
‣ The NvUffParser returns the output tensor in the shape specified by the user, and
not in NCHW shape as in earlier versions of TensorRT. In other words, the output
tensor shape will match the shape of the tensor returned by TensorFlow, for the
same network.
‣ The Python 3.4 documentation is missing from the Ubuntu 14.04 packages. Refer
to the Python 2.7 documentation or view the online Python documentation as an
alternative.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 41
TensorRT Release 4.x.x
‣ Some samples do not provide a -h argument to print the sample usage. You can
refer to the README.txt file in the sample directory for usage examples. Also, if the
data files for some samples cannot be found it will sometimes raise an exception and
abort instead of exiting normally.
‣ If you have more than one version of the CUDA toolkit installed on your system and
the CUDA version for TensorRT is not the latest version of the CUDA toolkit, then
you will need to provide an additional argument when compiling the samples. For
example, you have CUDA 9.0 and CUDA 9.2 installed and you are using TensorRT
for CUDA 9.0.
make CUDA_INSTALL_DIR=/usr/local/cuda-9.0
‣ When you pip uninstall the tensorrtplugins Python package, you may see
the following error which can be ignored.
‣ Due to a bug in cuDNN 7.1.3, which is the version of cuDNN TensorRT has been
validated against, using RNNs with half precision on Kepler GPUs will cause
TensorRT to abort. FP16 support is non-native on Kepler GPUs, therefore, using any
precision other than FP32 is discouraged except for testing.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 42
TensorRT Release 4.x.x
‣ Support for ONNX 1.0 (Open Neural Network Exchange) has been implemented.
TensorRT can now parse the network definitions in ONNX format, in addition to
NVCaffe and UFF formats.
‣ The Custom Layer API now supports user-defined layers that take half precision, or
FP16, inputs and return FP16 tensors.
‣ Added support for the MatrixMultiply, Constant, Gather, Ragged SoftMax, Reduce,
RNNv2 and TopK layers (for K up to 25).
Deprecated Features
‣ The RNN layer type is deprecated in favor of RNNv2, however, it is still available
for backwards compatibility.
‣ Legacy GIE version defines in NvInfer.h have been removed. They were
NV_GIE_MAJOR, NV_GIE_MINOR, NV_GIE_PATCH, and NV_GIE_VERSION. The correct
alternatives are NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH,
and NV_TENSORRT_VERSION which existed in TensorRT 3.0.4 as well.
‣ Dimension Types are now ignored in the API, however, they are still available for
backwards compatibility.
Known Issues
SampleMLP and SampleNMT are included in this release, however, they are beta
samples. They are currently not optimized for mobile platforms.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 43
TensorRT Release 4.x.x
‣ Support for ONNX 1.0 (Open Neural Network Exchange) has been implemented.
TensorRT can now parse the network definitions in ONNX format, in addition to
NVCaffe and UFF formats.
‣ The Custom Layer API now supports user-defined layers that take half precision, or
FP16, inputs and return FP16 tensors.
‣ Added support for the MatrixMultiply, Constant, Gather, Ragged SoftMax, Reduce,
RNNv2 and TopK layers (for K up to 25).
‣ The samples were tested with TensorFlow 1.6. You must be using cuDNN 7.0.x in
order to use both TensorRT and TensorFlow at the same time since TensorFlow 1.6
does not support cuDNN 7.1.x yet.
Deprecated Features
‣ The RNN layer type is deprecated in favor of RNNv2, however, it is still available
for backwards compatibility.
‣ Legacy GIE version defines in NvInfer.h have been removed. They were
NV_GIE_MAJOR, NV_GIE_MINOR, NV_GIE_PATCH, and NV_GIE_VERSION. The correct
alternatives are NV_TENSORRT_MAJOR, NV_TENSORRT_MINOR, NV_TENSORRT_PATCH,
and NV_TENSORRT_VERSION which existed in TensorRT 3.0.4 as well.
‣ Dimension Types are now ignored in the API, however, they are still available for
backwards compatibility.
Known Issues
‣ If you were previously using the machine learning debian repository, then it will
conflict with the version of libcudnn7 that is contained within the local repository
for TensorRT. The following commands will downgrade libcudnn7 to version
7.0.5.15, which is supported and tested with TensorRT, and hold the package at
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 44
TensorRT Release 4.x.x
this version. If you are using CUDA 8.0 for your application, ensure you replace
cuda9.0 with cuda8.0.
If you would like to later upgrade libcudnn7 to the latest version, then you can use
the following commands to remove the hold.
‣ If you have both the CUDA 8.0 and CUDA 9.0 local repos installed for TensorRT,
then you will need to execute an additional command to install the CUDA 8.0
version of TensorRT and prevent it from upgrading to the CUDA 9.0 version of
TensorRT.
‣ SampleMLP
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 45
TensorRT Release 4.x.x
‣ Performance regressions
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 46
Chapter 5.
TENSORRT RELEASE 3.X.X
‣ Fixed an issue with INT8 deconvolution bias. If you have seen an issue with
deconvolution INT8 accuracy especially regarding TensorRT. 2.1, then this fix
should solve the issue.
‣ Fixed an accuracy issue in FP16 mode for NVCaffe models.
‣ The UFF converter script is packaged only for x86 users. If you are not an x86 user,
and you want to convert TensorFlow models into UFF, you need to obtain the
conversion script from the x86 package of TensorRT.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 47
TensorRT Release 3.x.x
Known Issues
‣ When building the samples on aarch64 natively, there is an issue in the
Makefile.config file that requires you to provide an additional option to make,
namely CUDA_LIBDIR.
‣ The infer_caffe_static test fails on D5L Parker dGPU. This is a regression from
the previous release.
‣ QnX has known performance issues with the mmap and malloc() operating system
memory allocation routines. These issues can affect the performance of TensorRT; up
to 10X.
‣ Fixed a bug in one of the INT8 deconvolution kernels that was generating
incorrect results. This fixed accuracy regression from 2.1 for networks that use
deconvolutions.
‣ Fixed a bug where the builder would report out-of-memory when compiling a low
precision network, in the case that a low-precision version of the kernel could not
be found. The builder now correctly falls back to a higher precision version of the
kernel.
‣ Fixed a bug where the existence of some low-precision kernels were being
incorrectly reported to the builder.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 48
TensorRT Release 3.x.x
‣ When working with large networks and large batch sizes on the Jetson TX1 you
may see failures that are the result of CUDA error 4. This error generally means a
CUDA kernel failed to execute properly, but sometimes this can mean the CUDA
kernel actually timed out. The CPU and GPU share memory on the Jetson TX1 and
reducing the memory used by the CPU would help the situation. If you are not
using the graphical display on L4T you can stop the X11 server to free up CPU and
GPU memory. This can be done using:
Known Issues
‣ INT8 deconvolutions with biases have the bias scaled incorrectly. U-Net based
segmentation networks typically have non-zero bias.
‣ For TensorRT Android 32-bit, if your memory usage is high, then you may see
TensorRT failures. The issue is related to the CUDA allocated buffer address being
higher or equal to 0x80000000 and it is hard to know the exact memory usage after
which this issue is hit.
‣ If you are installing TensorRT from a tar package (instead of using the .deb
packages and apt-get), you will need to update the custom_plugins example
to point to the location that the tar package was installed into. For example,
in the <PYTHON_INSTALL_PATH>/tensorrt/examples/custom_layers/
tensorrtplugins/setup.py file change the following:
If you would like to later upgrade libcudnn7 to the latest version, then you can use
the following commands to remove the hold.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 49
TensorRT Release 3.x.x
‣ The TensorRT deconvolution layer previously did not support non-zero padding,
or stride values that were distinct from kernel size. These restrictions have now
been lifted.
‣ The TensorRT deconvolution layer now supports groups.
‣ Non-determinism in the deconvolution layer implementation has been
eliminated.
‣ The TensorRT convolution layer API now supports dilated convolutions.
‣ The TensorRT API now supports these new layers (but they are not supported via
the NvCaffeParser):
‣ unary
‣ shuffle
‣ padding
‣ The Elementwise (eltwise) layer now supports broadcasting of input dimensions.
‣ The Flatten layer flattens the input while maintaining the batch_size. This layer
was added in the UFF converter and NvUffParser.
‣ The Squeeze layer removes dimensions of size 1 from the shape of a tensor. This
layer was added in the UFF converter and NvUffParser.
Performance
‣ Performance regressions seen from v2.1 to 3.0.1 Release Candidate for INT8 and
FP16 are now fixed.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 50
TensorRT Release 3.x.x
‣ The INT8 regression in LRN that impacted networks like GoogleNet and
AlexNet is now fixed.
‣ The FP16 regression that impacted networks like AlexNet and ResNet-50 is
now fixed.
‣ The performance of the Xception network has improved, for example, by more
than 3 times when batch size is 8 on Tesla P4.
‣ Changed how the CPU synchronizes with the GPU in order to reduce the overall
load on the CPU when running inference with TensorRT.
‣ The deconvolution layer implementation included with TensorRT was, in some
circumstances, using significantly more memory and had lower performance
than the implementation provided by the cuDNN library. This has now been
fixed.
‣ MAX_TENSOR_SIZE changed from (1<<30) to ((1<<31)-1). This change
enables the user to run larger batch sizes for networks with large input images.
Samples
‣ All Python examples now import TensorRT after the appropriate framework is
imported. For example, the tf_to_trt.py example imports TensorFlow before
importing TensorRT. This is done to avoid cuDNN version conflict issues.
‣ The tf_to_trt and pytorch_to_trt samples shipped with the TensorRT 3.0
Release Candidate included network models that were improperly trained with
the MNIST dataset, resulting in poor classification accuracy. This version has new
models that have been properly trained with the MNIST dataset to provide better
classification accuracy.
‣ The pytorch_to_trt sample originally showed low accuracy with MNIST,
however, data and training parameters were modified to address this.
‣ The giexec command line wrapper in earlier versions would fail if users specify
workspace >= 2048 MB. This issue is now fixed.
Functionality
The AverageCountExcludesPadding attribute has been added to the pooling
layer to control whether to use inclusive or exclusive averaging. The default is
true, as used by most frameworks. The NvCaffeParser sets this to false, restoring
compatibility of padded average pooling between NVCaffe and TensorRT.
‣ the NvCaffeParser
‣ the NvUffParser
‣ The nvinfer graph definition API
‣ the inference engine builder
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 51
TensorRT Release 3.x.x
‣ TensorFlow 1.3
‣ FP32 CNNs
‣ FP16 CNNs
The TensorFlow export path is currently not expected to support the following:
Volta
The NVIDIA Volta architecture is now supported, including the Tesla V100 GPU. On
Volta devices, the Tensor Core feature provides a large performance improvement,
and Tensor Cores are automatically used when the builder is set to half2mode.
QNX
TensorRT 3.0.1 runs on the QNX operating system on the Drive PX2 platform.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 52
TensorRT Release 3.x.x
‣ There are various details in the Release Notes and Developer Guide about the
pytorch_to_trt Python example. This sample is no longer part of the package
because of cuDNN symbol conflict issues between PyTorch and TensorRT.
‣ In the Installation and Setup section of the Release Notes, it is mentioned that
TENSORRT_LIB_DIR should point to <TAR_INSTALL_ROOT>/lib64. Instead,
TENSORRT_LIB_DIR should point to <TAR_INSTALL_ROOT>/lib.
‣ There are some known minor performance regressions for FP32 mode on K80 for
large batch sizes on CUDA 8. Update to CUDA 9 if you see similar performance
regression.
‣ Although networks can use NHWC and NCHW, TensorFlow users are encouraged
to convert their networks to use NCHW data ordering explicitly in order to achieve
the best possible performance.
‣ The libnvcaffe_parsers.so library file is now called libnvparsers.so. The
links for libnvcaffe_parsers are updated to point to the new libnvparsers
library. The static library libnvcaffe_parser.a is also linked to the new
libnvparsers.
Known Issues
Installation and Setup
‣ If you are installing TensorRT from a tar package (instead of using the .deb
packages and apt-get), you will need to update the custom_plugins example
to point to the location that the tar package was installed into. For example,
in the <PYTHON_INSTALL_PATH>/tensorrt/examples/custom_layers/
tensorrtplugins/setup.py file change the following:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 53
TensorRT Release 3.x.x
‣ The TensorRT Python APIs are only supported on x86 based systems. Some
installation packages for ARM based systems may contain Python .whl files. Do not
install these on the ARM systems, as they will not function.
‣ The TensorRT product version is incremented from 2.1 to 3.0.1 because we added
major new functionality to the product. The libnvinfer package version number
was incremented from 3.0.2 to 4.0 because we made non-backward compatible
changes to the application programming interface.
‣ The TensorRT debian package name was simplified in this release to tensorrt.
In previous releases, the product version was used as a suffix, for example
tensorrt-2.1.2.
‣ If you have trouble installing the TensorRT Python modules on Ubuntu 14.04, refer
to the steps on installing swig to resolve the issue. For installation instructions, see
Unix Installation.
‣ The Flatten layer can only be placed in front of the Fully Connected layer. This
means that the Flatten layer can only be used if its output is directly fed to a Fully
Connected layer.
‣ The Squeeze layer only implements the binary squeeze (removing specific size 1
dimensions). The batch dimension cannot be removed.
‣ If you see the Numpy.core.multiarray failed to import error message,
upgrade your NumPy to version 1.13.0 or greater.
‣ For Ubuntu 14.04, use pip version >= 9.0.1 to get all the dependencies installed.
‣ The TensorFlow to TensorRT model export works only when running TensorFlow
with GPU support enabled. The converter does not work if TensorFlow is running
without GPU acceleration.
‣ The TensorFlow to TensorRT model export does not work with network models
specified using the TensorFlow Slim interface, nor does it work with models
specified using the Keras interface.
‣ The TensorFlow to TensorRT model export does not support recurrent neural
network (RNN) models.
‣ The TensorFlow to TensorRT model export may produce a model that has extra
tensor reformatting layers compared to a model generated directly using the C++ or
Python TensorRT graph builder API. This may cause the model that originated from
TensorFlow to run slower than the model constructed directly with the TensorRT
APIs.
‣ Although TensorFlow models can use either NHWC or NCHW tensor layouts,
TensorFlow users are encouraged to convert their models to use the NCHW tensor
layout explicitly, in order to achieve the best possible performance when exporting
the model to TensorRT.
‣ The TensorFlow parser requires that input will be fed to the network in NCHW
format.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 54
TensorRT Release 3.x.x
‣ On the V100 GPU, running models with INT8 only works if the batch size is evenly
divisible by 4.
‣ TensorRT Python interface requires NumPy 1.13.0 while the installing TensorRT
using pip may only install 1.11.0. Use sudo pip install numpy -U to update if
the NumPy version on the user machine is not 1.13.0.
‣ Tensorflow 1.3
‣ FP32 CNNs
‣ FP16 CNNs
The TensorFlow export path is currently not expected to support the following:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 55
TensorRT Release 3.x.x
Python API
TensorRT 3.0 introduces the TensorRT Python API, which provides developers
interfaces to:
‣ the NvCaffeParser
‣ the NvUffParser
‣ The nvinfer graph definition API
‣ the inference engine builder
‣ the engine executor
TensorRT also introduces a workflow to include C++ custom layer implementations in
Python based TensorRT applications.
‣ The TensorRT deconvolution layer previously did not support non-zero padding,
or stride values that were distinct from kernel size. These restrictions have now
been lifted.
‣ The TensorRT deconvolution layer now supports groups.
‣ Non-determinism in the deconvolution layer implementation has been
eliminated.
‣ The TensorRT convolution layer API now supports dilated convolutions.
‣ The TensorRT API now supports these new layers (but they are not supported via
the NvCaffeParser):
‣ unary
‣ shuffle
‣ padding
‣ The Elementwise (eltwise) layer now supports broadcasting of input dimensions.
QNX
TensorRT 3.0 runs on the QNX operating system on the Drive PX2 platform.
Known Issues
Installation and Setup
‣ If you are installing TensorRT from a tar package (instead of using the .deb
packages and apt-get), then the custom_plugins example will need to be
updated to point to the location that the tar package was installed to. For example,
in the <PYTHON_INSTALL_PATH>/tensorrt/examples/custom_layers/
tensorrtplugins/setup.py file change the following:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 56
TensorRT Release 3.x.x
‣ The TensorFlow to TensorRT model export works only when running TensorFlow
with GPU support enabled. The converter does not work if TensorFlow is running
without GPU acceleration.
‣ The TensorFlow to TensorRT model export does not work with network models
specified using the TensorFlow Slim interface, nor does it work with models
specified using the Keras interface.
‣ The TensorFlow to TensorRT model export does not support recurrent neural
network (RNN) models.
‣ The TensorFlow to TensorRT model export does not support convolutional layers
that have asymmetric padding (a different number of zero-padded rows and
columns).
‣ The TensorFlow to TensorRT model export may produce a model that has extra
tensor reformatting layers compared to a model generated directly using the C++ or
Python TensorRT graph builder API. This may cause the model that originated from
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 57
TensorRT Release 3.x.x
TensorFlow to run slower than the model constructed directly with the TensorRT
APIs.
‣ Although TensorFlow models can use either NHWC or NCHW tensor layouts,
TensorFlow users are encouraged to convert their models to use the NCHW tensor
layout explicitly, in order to achieve the best possible performance.
Other known issues
‣ The Inception v4 network models are not supported with this Release Candidate
with FP16 on V100.
‣ On V100, running models with INT8 do not work if the batch size is not divisible by
4.
‣ The Average Pooling behavior has changed to exclude padding from the
computation, which is how all other Pooling modes handle padding. This results in
incorrect behavior for network models which rely on Average Pooling and which
include padding, such as Inception v3. This issue will be addressed in a future
release.
‣ In this Release Candidate, the arguments for the tensorrt_exec.py script are
slightly different than the ones for the giexec executable, and can be a source
of confusion for users. Consult the documentation carefully to avoid unexpected
errors. The command-line arguments will be changed to match giexec in a future
update.
‣ The INT8 Calibration feature is not available in the TensorRT Python APIs.
‣ The examples/custom_layer sample will not work on Ubuntu 14.04 x86_64
systems, however, it does work properly on Ubuntu 16.04 systems. This will be fixed
in the next update of the software.
‣ Tensorflow 1.0
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 58
TensorRT Release 3.x.x
‣ FP32 CNNs
‣ FP16 CNNs
The TensorFlow export path is currently not expected to support the following:
‣ the NvCaffeParser
‣ the NvUffParser
‣ The nvinfer graph definition API
‣ the inference engine builder
‣ the inference-time interface for engine execution within Python
TensorRT also introduces a workflow to include C++ custom layer implementations in
Python based TensorRT applications.
‣ Although networks can use NHWC and NCHW, TensorFlow users are encouraged
to convert their networks to use NCHW data ordering explicitly in order to achieve
the best possible performance.
‣ Average pooling behavior changed to exclude the padding from the computation.
The padding is now excluded from the computation in all of the pooling modes.
This results in incorrect behavior for networks which rely on average pooling which
includes padding, such as inceptionV3. This issue will be addressed in a future
release.
‣ The libnvcaffe_parsers.so library file is now called libnvparsers.so. The
links for libnvcaffe_parsers are updated to point to the new libnvparsers
library. The static library libnvcaffe_parser.a is also linked to the new
libnvparsers. For example:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 59
TensorRT Release 3.x.x
Known Issues
‣ TensorRT does not support asymmetric padding.
‣ Some TensorRT optimizations disabled just for this Early Release (EA) to ensure that
the UFF model runs properly. This will be addressed in TensorRT 3.0.
‣ The TensorFlow conversion path is not fully optimized.
‣ INT8 Calibration is not available in Python.
‣ Deconvolution is not implemented in the UFF workflow.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 60
Chapter 6.
TENSORRT RELEASE 2.X.X
Installers
You have two ways you can install TensorRT 2.1:
1. Ubuntu deb packages. If you have root access and prefer to use package
management to ensure consistency of dependencies, then you can use the apt-
get command and the deb packages.
2. Tar file based installers. If you do not have root access or you want to install
multiple versions of TensorRT side-by-side for comparison purposes, then you
can use the tar file install. The tar file installation uses target dep-style directory
structures so that you can install TensorRT libraries for multiple architectures and
then do cross compilation.
INT8 support
TensorRT can be used on supported GPUs (such as P4 and P40) to execute networks
using INT8 rather than FP32 precision. Networks using INT8 deliver significant
performance improvements.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 61
TensorRT Release 2.x.x
‣ Running networks in FP16 or INT8 may not work correctly on platforms without
hardware support for the appropriate reduced precision instructions.
‣ GTX 750 and K1200 users will need to upgrade to CUDA 8 in order to use TensorRT.
‣ If you have previously installed TensorRT 2.0 EA or TensorRT 2.1 RC and you install
TensorRT 2.1, you may find that the old meta package is still installed. It can be
safely removed with the apt-get command.
‣ Debian packages are supplied in the form of local repositories. Once you have
installed TensorRT, you can safely remove the TensorRT local repository debian
package.
‣ The implementation of deconvolution is now deterministic. In order to ensure
determinism, the new algorithm requires more workspace.
‣ FP16 performance was significantly improved for batch size = 1. The new algorithm
is sometimes slower for batch sizes greater than one.
‣ Calibration for INT8 does not require labeled data. SampleINT8 uses labels only to
compare the accuracy of INT8 inference with the accuracy of FP32 inference.
‣ Running with larger batch sizes gives higher overall throughput but uses more
memory. When trying TensorRT out on GPUs with smaller memory, be aware that
some of the samples may not work with batch sizes of 128.
‣ The included Caffe parser library does not currently understand the NVIDIA/Caffe
format for batch normalization. The BVLC/Caffe batch normalization format is
parsed correctly.
Deprecated Features
The parameterized calibration technique introduced in the 2.0 EA pre-release has been
replaced by the new entropy calibration mechanism.
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 62
TensorRT Release 2.x.x
Known Issues
‣ When using reduced precision, either INT8 or FP16, on platforms with hardware
support for those types, pooling with window sizes other than 1,2,3,5 or 7 will fail.
‣ When using MAX_AVERAGE_BLEND or AVERAGE pooling in INT8 with a channel
count that is not a multiple of 4, TensorRT may generate incorrect results.
‣ When downloading the Faster R-CNN data on Jetson TX1 users may see the
following error:
www.nvidia.com
TensorRT SWE-SWDOCTRT-001-RELN_vTensorRT 6.0.1 | 63
Notice
THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION
REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED,
STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A
PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever,
NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall
be limited in accordance with the NVIDIA terms and conditions of sale for the product.
THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED,
MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE,
AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A
SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE
(INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER
LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS
FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR
IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.
NVIDIA makes no representation or warranty that the product described in this guide will be suitable for
any specified use without further testing or modification. Testing of all parameters of each product is not
necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and
fit for the application planned by customer and to do the necessary testing for the application in order
to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect
the quality and reliability of the NVIDIA product and may result in additional or different conditions and/
or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any
default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA
product in any manner that is contrary to this guide, or (ii) customer product designs.
Other than the right for customer to use the information in this guide with the product, no other license,
either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information
in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without
alteration, and is accompanied by all associated conditions, limitations, and notices.
Trademarks
NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, cuDNN, cuFFT, cuSPARSE, DALI, DIGITS, DGX, DGX-1, Jetson,
Kepler, NVIDIA Maxwell, NCCL, NVLink, Pascal, Tegra, TensorRT, and Tesla are trademarks and/or registered
trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product
names may be trademarks of the respective companies with which they are associated.
Copyright
© 2019 NVIDIA Corporation. All rights reserved.
www.nvidia.com