8 Bit Inference With TensorRT
8 Bit Inference With TensorRT
8 Bit Inference With TensorRT
● INT8 compute
● Quantization
● Calibration
● Workflow in TensorRT
● Results
INT8 Inference
Challenge
● INT8 has significantly lower precision and dynamic range compared to FP32.
int32
Context
● Performance.
● No accuracy loss.
● Hence solution has to be “simple” and compute efficient.
Linear quantization
Representation:
A = scale_A * QA + bias_A
B = scale_B * QB + bias_B
A * B = scale_A * scale_B * QA * QB +
scale_A * QA * bias_B +
scale_B * QB * bias_A +
bias_A * bias_B
Do we really need bias?
Two matrices:
A = scale_A * QA + bias_A
B = scale_B * QB + bias_B
A * B = scale_A * scale_B * QA * QB +
scale_A * QA * bias_B +
scale_B * QB * bias_A +
bias_A * bias_B
Do we really need bias? No!
Two matrices:
A = scale_A * QA
B = scale_B * QB
A * B = scale_A * scale_B * QA * QB
Symmetric linear quantization
Representation:
-127 0 127
Quantization
-127 0 127
● INT8 model encodes the same information as the original FP32 model.
● We want to minimize loss of information.
● Loss of information is measured by Kullback-Leibler divergence (AKA
relative entropy or information divergence).
○ P, Q - two discrete probability distributions.
○ KL_divergence(P,Q):= SUM(P[i] * log(P[i] / Q[i] ), i)
● Intuition: KL divergence measures the amount of information lost when
approximating a given encoding.
Solution: Calibration
● Run FP32 inference on Calibration Dataset.
● For each Layer:
○ collect histograms of activations.
○ generate many quantized distributions
with different saturation thresholds.
○ pick threshold which minimizes
KL_divergence(ref_distr, quant_distr).
● Entire process takes a few minutes on a
typical desktop workstation.
Calibration Dataset
● Representative.
● Diverse.
● 1000s of samples
Results from Calibration
Results From Calibration #1
Results From Calibration #2
Results From Calibration #2
Before saturation After saturation
Results From Calibration #3
Results From Calibration #4
Results From Calibration #5
Workflow in TensorRT
Typical workflow in TensorRT
● You will need:
○ Model trained in FP32.
○ Calibration dataset.
● TensorRT will:
○ Run inference in FP32 on calibration dataset.
○ Collect required statistics.
○ Run calibration algorithm → optimal scaling factors.
○ Quantize FP32 weights → INT8.
○ Generate “CalibrationTable” and INT8 execution engine.
Results
Results - Accuracy
FP32 INT8
Calibration using 5 batches Calibration using 10 batches Calibration using 50 batches
NETWORK Top1 Top5 Top1 Top5 Top1 Top5 Top1 Top5
Resnet-50 73.23% 91.18% 73.03% 91.15% 73.02% 91.06% 73.10% 91.06%
Resnet-101 74.39% 91.78% 74.52% 91.64% 74.38% 91.70% 74.40% 91.73%
Resnet-152 74.78% 91.82% 74.62% 91.82% 74.66% 91.82% 74.70% 91.78%
VGG-19 68.41% 88.78% 68.42% 88.69% 68.42% 88.67% 68.38% 88.70%
Googlenet 68.57% 88.83% 68.21% 88.67% 68.10% 88.58% 68.12% 88.64%
Alexnet 57.08% 80.06% 57.00% 79.98% 57.00% 79.98% 57.05% 80.06%
NETWORK Top1 Top5 Diff Top1 Diff Top5 Diff Top1 Diff Top5 Diff Top1 Diff Top5
Resnet-50 73.23% 91.18% 0.20% 0.03% 0.22% 0.13% 0.13% 0.12%
Resnet-101 74.39% 91.78% -0.13% 0.14% 0.01% 0.09% -0.01% 0.06%
Resnet-152 74.78% 91.82% 0.15% 0.01% 0.11% 0.01% 0.08% 0.05%
VGG-19 68.41% 88.78% -0.02% 0.09% -0.01% 0.10% 0.03% 0.07%
Googlenet 68.57% 88.83% 0.36% 0.16% 0.46% 0.25% 0.45% 0.19%
Alexnet 57.08% 80.06% 0.08% 0.08% 0.08% 0.07% 0.03% -0.01%
TensorRT 2.1, all optimizations enabled. ILSVRC2012 validation dataset, batch = 25 images.
Accuracy was measured on 500 batches which were not used for the calibration.
Results - Performance
● Quantize original FP32 data such that the information loss is minimized.
Here is a simple example: reference distribution P consisting of 8 bins, we want to quantize into 2 bins:
P = [ 1, 0, 2, 3, 5, 3, 1, 7]
we merge into 2 bins (8 / 2 = 4 consecutive bins are merged into one bin)
[1 + 0 + 2 + 3 , 5 + 3 + 1 + 7] = [6, 16]
then proportionally expand back to 8 bins, we preserve empty bins from the original distribution P:
Q = [ 6/3, 0, 6/3, 6/3, 16/4, 16/4, 16/4, 16/4] = [ 2, 0, 2, 2, 4, 4, 4, 4]
now we should normalize both distributions, after that we can compute KL_divergence
P /= sum(P) Q /= sum(Q)
result = KL_divergence(P, Q)
Pseudocode for the INT8 conv kernel
// I8 input tensors: I8_input, I8_weights, I8 output tensors: I8_output
// F32 bias (original bias from the F32 model)
// F32 scaling factors: input_scale, output_scale, weights_scale[K]
// Add bias, to perform addition we have to rescale original F32 bias so that it's scaled with "output_scale"
rescaled_F32_gemm_out _with_bias = rescaled_F32_gemm_out + output_scale * bias
INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32
Network [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio
Resnet-50 562 415 1.354 1045 572 1.825 2572 938 2.741 3567 1126 3.166 3787 1156 3.276
Resnet-152 195 157 1.242 371 205 1.807 1017 357 2.850 1335 415 3.220 1437 436 3.299
VGG-16 393 261 1.508 606 257 2.361 984 382 2.577 1131 416 2.722 1178 426 2.764
VGG-19 345 221 1.559 523 222 2.358 812 311 2.608 916 336 2.729 946 339 2.789
Googlenet 945 913 1.035 1756 1163 1.510 4356 1737 2.508 6545 2300 2.846 7282 2499 2.914
Alexnet 972 823 1.181 1913 1534 1.247 6434 3638 1.768 13899 4758 2.921 18714 5882 3.181
INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32 INT8 FP32
Network [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio [img/s] [img/s] Ratio
Resnet-50 295 148 1.996 462 181 2.552 627 204 3.075 811 233 3.477 902 249 3.621
Resnet-152 110 59 1.871 179 68 2.617 239 79 3.039 318 91 3.496 356 97 3.674
VGG-16 130 47 2.757 189 62 3.029 229 71 3.220 286 84 3.411 DNR DNR
VGG-19 114 41 2.797 162 52 3.117 191 58 3.296 233 67 3.464 DNR DNR
Googlenet 497 306 1.622 777 364 2.131 1131 408 2.769 1576 497 3.170 1784 529 3.375
Alexnet 436 202 2.164 828 348 2.381 1458 570 2.561 3106 844 3.682 4853 1382 3.510