Smoothquant: Accurate and Efficient Post-Training Quantization For Large Language Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

SmoothQuant: Accurate and Efficient

Post-Training Quantization for Large Language Models

Guangxuan Xiao * 1 Ji Lin * 1 Mickael Seznec 2 Hao Wu 2 Julien Demouth 2 Song Han 1

Abstract Table 1: SmoothQuant achieves high hardware efficiency


while maintaining the accuracy of LLMs with 530 billion
arXiv:2211.10438v4 [cs.CL] 14 Feb 2023

Large language models (LLMs) show excel- parameters in a training-free fashion.


lent performance but are compute- and memory-
intensive. Quantization can reduce memory and LLM (100B+) Hardware
accelerate inference. However, for LLMs be- Accuracy Efficiency
yond 100 billion parameters, existing methods
ZeroQuant % "
cannot maintain accuracy or do not run effi-
Outlier Suppression % "
ciently on hardware. We propose SmoothQuant,
LLM.int8() " %
a training-free, accuracy-preserving, and general-
purpose post-training quantization (PTQ) solution SmoothQuant " "
to enable 8-bit weight, 8-bit activation (W8A8)
quantization for LLMs. Based on the fact that
weights are easy to quantize while activations are consuming due to their gigantic model size. For exam-
not, SmoothQuant smooths the activation outliers ple, the GPT-3 (Brown et al., 2020a) model contains 175B
by offline migrating the quantization difficulty parameters, which will consume at least 350GB of mem-
from activations to weights with a mathemati- ory to store and run in FP16, requiring 8×48GB A6000
cally equivalent transformation. SmoothQuant GPUs or 5×80GB A100 GPUs just for inference. Due to
enables an INT8 quantization of both weights the huge computation and communication overhead, the
and activations for all the matrix multiplications inference latency may also be unacceptable to real-world
in LLMs, including OPT-175B, BLOOM-176B, applications. Quantization is a promising way to reduce
GLM-130B, and MT-NLG 530B. SmoothQuant the cost of LLMs (Dettmers et al., 2022; Yao et al., 2022).
has better hardware efficiency than existing tech- By quantizing the weights and activations with low-bit in-
niques. We demonstrate up to 1.56× speedup tegers, we can reduce GPU memory requirements, in size
and 2× memory reduction for LLMs with negligi- and bandwidth, and accelerate compute-intensive operations
ble loss in accuracy. We integrate SmoothQuant (i.e., GEMM in linear layers, BMM in attention). For instance,
into FasterTransformer, a state-of-the-art LLM INT8 quantization of weights and activations can halve the
serving framework, and achieve faster inference GPU memory usage and nearly double the throughput of
speed with half the number of GPUs compared matrix multiplications compared to FP16.
to FP16, enabling the serving of a 530B LLM However, unlike CNN models or smaller transformer mod-
within a single node. Our work offers a turn- els like BERT (Devlin et al., 2019), the activations of LLMs
key solution that reduces hardware costs and de- are difficult to quantize. When we scale up LLMs beyond
mocratizes LLMs. Code is available at https: 6.7B parameters, systematic outliers with large magnitude
//github.com/mit-han-lab/smoothquant. will emerge in activations (Dettmers et al., 2022), leading
to large quantization errors and accuracy degradation. Ze-
roQuant (Yao et al., 2022) applies dynamic per-token ac-
1 Introduction tivation quantization and group-wise weight quantization
(defined in Figure 2 Sec. 2). It can be implemented effi-
Large-scale language models (LLMs) show excellent per- ciently and delivers good accuracy for GPT-3-350M and
formance on various tasks (Brown et al., 2020a; Zhang GPT-J-6B. However, it can not maintain the accuracy for
et al., 2022). However, serving LLMs is budget and energy- the large OPT model with 175 billion parameters (see Sec-
*
Equal contribution 1 Massachusetts Institute of Technology tion 5.2). LLM.int8() (Dettmers et al., 2022) addresses
2
NVIDIA. Correspondence to: Guangxuan Xiao <[email protected]>, that accuracy issue by further introducing a mixed-precision
Ji Lin <[email protected]>. decomposition (i.e., it keeps outliers in FP16 and uses INT8
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

outlier |X| |W| per-tensor quant. per-tensor quant.


10 0.1
Ci ΔW[1]
quant. levels

ΔX[1]
s [1×Cbits
low effective 0] Co
Co T X
Ci
0 0
* Ci W
T hard to quantize very easy to quantize
X Co
Ci (a) Original
(a) per-tensor quantization
* W
smoothed | X̂ | migrate difficulty
ΔX[T×1] ΔW[1×C0]
| Ŵ | Ci
1 1
Co
quant. levels

T X
* Ci
W
0 0
easy to quantize easy to quantize per-token quant. per-channel quant.
(b) per-token + per-channel quantization
(b) SmoothQuant
Figure 2: Definition of per-tensor, per-token, and per-
Figure 1: SmoothQuant’s intuition: the activation X is hard channel quantization. Per-tensor quantization is the most
to quantize because outliers stretch the quantization range, efficient to implement. For vector-wise quantization to ef-
leaving few effective bits for most values. We migrate the ficiently utilize the INT8 GEMM kernels, we can only use
scale variance from activations to weights W during offline scaling factors from the outer dimensions (i.e., token di-
to reduce the quantization difficulty of activations. The mension T and out channel dimension Co ) but not inner
smoothed activation X̂ and the adjusted weight Ŵ are both dimension (i.e., in channel dimension Ci ).
easy to quantize.

els like OPT-175B using only half number of GPUs com-


for the other activations). However, it is hard to imple- pared to FP16 while being faster, and enabling the serving
ment the decomposition efficiently on hardware accelera- of a 530B model within one 8-GPU node. Our work democ-
tors. Therefore, deriving an efficient, hardware-friendly, and ratizes the use of LLMs by offering a turnkey solution to
preferably training-free quantization scheme for LLMs that reduce the serving cost. We hope SmoothQuant can inspire
would use INT8 for all the compute-intensive operations greater use of LLMs in the future.
remains an open challenge.
We propose SmoothQuant, an accurate and efficient 2 Preliminaries
post-training quantization (PTQ) solution for LLMs.
SmoothQuant relies on a key observation: even if activations Quantization maps a high-precision value into discrete lev-
are much harder to quantize than weights due to the presence els. We study integer uniform quantization (Jacob et al.,
of outliers (Dettmers et al., 2022), different tokens exhibit 2018) (specifically INT8) for better hardware support and
similar variations across their channels. Based on this obser- efficiency. The quantization process can be expressed as:
vation, SmoothQuant offline migrates the quantization diffi-
culty from activations to weights (Figure 1). SmoothQuant XFP16 max(|X|)
X̄INT8 = d c, ∆= , (1)
proposes a mathematically equivalent per-channel scaling ∆ 2N −1 − 1
transformation that significantly smooths the magnitude
where X is the floating-point tensor, X̄ is the quantized
across the channels, making the model quantization-friendly.
counterpart, ∆ is the quantization step size, d·c is the round-
Since SmoothQuant is compatible with various quantization
ing function, and N is the number of bits (8 in our case).
schemes, we implement three efficiency levels of quantiza-
Here we assume the tensor is symmetric at 0 for simplicity;
tion settings for SmoothQuant (see Table 3, O1-O3). Exper-
the discussion is similar for asymmetric cases (e.g., after
iments show that SmoothQuant is hardware-efficient: it can
ReLU) by adding a zero-point (Jacob et al., 2018).
maintain the performance of OPT-175B (Zhang et al., 2022),
BLOOM-176B (Scao et al., 2022) , GLM-130B (Zeng et al., Such quantizer uses the maximum absolute value to calcu-
2022), and MT-NLG 530B (Smith et al., 2022), leading late ∆ so that it preserves the outliers in activation, which
to up to 1.51× speed up and 1.96× memory saving on are found to be important for accuracy (Dettmers et al.,
PyTorch. SmoothQuant is easy to implement. We inte- 2022). We can calculate ∆ offline with the activations of
grate SmoothQuant into FasterTransformer, the state-of-the- some calibration samples, what we call static quantization.
art transformer serving framework, achieving up to 1.56× We can also use the runtime statistics of activations to get ∆,
speedup and halving the memory usage compared with what we call dynamic quantization. As shown in Figure 2,
FP16. Remarkably, SmoothQuant allows serving large mod- quantization has different granularity levels. The per-tensor
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

quantization uses a single step size for the entire matrix. We Table 2: Among different activation quantization schemes,
can further enable finer-grained quantization by using dif- only per-channel quantization (Bondarenko et al., 2021) pre-
ferent quantization step sizes for activations associated with serves the accuracy, but it is not compatible (marked in gray)
each token (per-token quantization) or each output channel with INT8 GEMM kernels. We report the average accuracy
of weights (per-channel quantization). A coarse-grained on WinoGrande, HellaSwag, PIQA, and LAMBADA.
version of per-channel quantization is to use different quanti-
zation steps for different channel groups, called group-wise Model size (OPT-) 6.7B 13B 30B 66B 175B
quantization (Shen et al., 2020; Yao et al., 2022). FP16 64.9% 65.6% 67.9% 69.5% 71.6%
For a linear layer in Transformers (Vaswani et al., 2017) INT8 per-tensor 39.9% 33.0% 32.8% 33.1% 32.3%
INT8 per-token 42.5% 33.0% 33.1% 32.9% 31.7%
Y = X · W, Y ∈ RT ×Co , X ∈ RT ×Ci , W ∈ RCi ×Co , INT8 per-channel 64.8% 65.6% 68.0% 69.4% 71.4%
where T is the number of tokens, Ci is the input channel,
and Co is the output channel (see Figure 2, we omit the
batch dimension for simplicity), we can reduce the storage given channel across tokens is small (outlier channels are
by half compared to FP16 by quantizing the weights to INT8. consistently large). Due to the persistence of outliers
However, to speed up the inference, we need to quantize and the small variance inside each channel, if we could per-
both weights and activations into INT8 (i.e., W8A8) to form per-channel quantization (Bondarenko et al., 2021) of
utilize the integer kernels (e.g., INT8 GEMM), which are the activation (i.e., using a different quantization step for
supported by a wide range of hardware (e.g., NVIDIA GPUs, each channel), the quantization error would be much smaller
Intel CPUs, Qualcomm DSPs, etc.). compared to per-tensor quantization, while per-token quan-
tization helps little. In Table 2, we verify the assumption
3 Review of Quantization Difficulty that simulated per-channel activation quantization success-
fully bridges the accuracy with the FP16 baseline, which
LLMs are notoriously difficult to quantize due to the outliers echos the findings of Bondarenko et al..
in the activations (Dettmers et al., 2022; Wei et al., 2022;
Bondarenko et al., 2021). We first review the difficulties However, per-channel activation quantization does not map
of activation quantization and look for a pattern amongst well to hardware-accelerated GEMM kernels, that rely on a
outliers. We visualize the input activations and the weights sequence of operations executed at a high throughput (e.g.,
of a linear layer that has a large quantization error in Figure 3 Tensor Core MMAs) and do not tolerate the insertion of
(left). We can find several patterns that motivate our method: instructions with a lower throughput (e.g., conversions or
CUDA Core FMAs) in that sequence. In those kernels, scal-
1. Activations are harder to quantize than weights. The ing can only be performed along the outer dimensions of the
weight distribution is quite uniform and flat, which is easy matrix multiplication (i.e., token dimension of activations
to quantize. Previous work has shown that quantizing the T , output channel dimension of weights Co , see Figure 2),
weights of LLMs with INT8 or even with INT4 does not which can be applied after the matrix multiplication finishes:
degrade accuracy (Dettmers et al., 2022; Yao et al., 2022;
Zeng et al., 2022), which echoes our observation.
Y = diag(∆FP16
X ) · (X̄
INT8
· W̄INT8 ) · diag(∆FP16
W ) (2)
2. Outliers make activation quantization difficult. The
scale of outliers in activations is ∼ 100× larger than most of Therefore, previous works all use per-token activation quan-
the activation values. In the case of per-tensor quantization tization for linear layers (Dettmers et al., 2022; Yao et al.,
(Equation 1), the large outliers dominate the maximum mag- 2022), although they cannot address the difficulty of activa-
nitude measurement, leading to low effective quantization tion quantization (only slightly better than per-tensor).
bits/levels (Figure 1) for non-outlier channels: suppose the
maximum magnitude of channel i is mi , and the maximum
value of the whole matrix is m, the effective quantization 4 SmoothQuant
levels of channel i is 28 · mi /m. For non-outlier channels, Instead of per-channel activation quantization (which is
the effective quantization levels would be very small (2-3), infeasible), we propose to “smooth” the input activation
leading to large quantization errors. by dividing it by a per-channel smoothing factor s ∈ RCi .
3. Outliers persist in fixed channels. Outliers appear To keep the mathematical equivalence of a linear layer, we
in a small fraction of the channels. If one channel has an scale the weights accordingly in the reversed direction:
outlier, it persistently appears in all tokens (Figure 3, red).
The variance amongst the channels for a given token is large Y = (Xdiag(s)−1 ) · (diag(s)W) = X̂Ŵ (3)
(the activations in some channels are very large, but most Considering input X is usually produced from previous
are small), but the variance between the magnitudes of a linear operations (e.g., linear layers, layer norms, etc.), we
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Figure 3: Magnitude of the input activations and weights of a linear layer in OPT-13B before and after SmoothQuant.
Observations: (1) there are a few channels in the original activation map whose magnitudes are very large (greater than 70);
(2) the variance in one activation channel is small; (3) the original weight distribution is flat and uniform. SmoothQuant
migrates the outlier channels from activation to weight. In the end, the outliers in the activation are greatly smoothed while
the weight is still pretty smooth and flat.

can easily fuse the smoothing factor into previous layers’


parameters offline, which doe not incur kernel call overhead
from an extra scaling. For some other cases, when the input
is from a residual add, we can add an extra scaling to the
residual branch similar to Wei et al. (2022).

Figure 4: Main idea of SmoothQuant when α is 0.5. The


Migrate the quantization difficulty from activations to
smoothing factor s is obtained on calibration samples and
weights. We aim to choose a per-channel smoothing fac-
the entire transformation is performed offline. At runtime,
tor s such that X̂ = Xdiag(s)−1 is easy to quantize. To
the activations are smooth without scaling.
reduce the quantization error, we should increase the ef-
fective quantization bits for all the channels. The total
effective quantization bits would be largest when all the activation to weights, using the following equation:
channels have the same maximum magnitude. Therefore, a
straight-forward choice is sj = max(|Xj |), j = 1, 2, ..., Ci , sj = max(|Xj |)α / max(|Wj |)1−α (4)
where j corresponds to j-th input channel. This choice
ensures that after the division, all the activation channels We find that for most of the models, e.g., all OPT (Zhang
will have the same maximum value, which is easy to quan- et al., 2022) and BLOOM (Scao et al., 2022) models,
tize. Note that the range of activations is dynamic; it varies α = 0.5 is a well-balanced point to evenly split the quan-
for different input samples. Here, we estimate the scale tization difficulty, especially when we are using the same
of activations channels using calibration samples from the quantizer for weights and activations (e.g., per-tensor, static
pre-training dataset (Jacob et al., 2018). However, this for- quantization). The formula ensures that the weights and
mula pushes all the quantization difficulties to the weights. activations at the corresponding channel share a similar
We find that, in this case, the quantization errors would maximum value, thus sharing the same quantization dif-
be large for the weights (outlier channels are migrated to ficulty. Figure 4 illustrates the smoothing transformation
weights now), leading to a large accuracy degradation (see when we take α = 0.5. For some other models where acti-
Figure 9). On the other hand, we can also push all the quan- vation outliers are more significant (e.g., GLM-130B (Zeng
tization difficulty from weights to activations by choosing et al., 2022) has ∼30% outliers, which are more difficult
sj = 1/ max(|Wj |). Similarly, the model performance is for activation quantization), we can choose a larger α to
bad due to the activation quantization errors. Therefore, we migrate more quantization difficulty to weights (like 0.75).
need to split the quantization difficulty between weights and
Applying SmoothQuant to Transformer blocks. Linear
activations so that they are both easy to quantize.
layers take up most of the parameters and computation of
Here we introduce a hyper-parameter, migration strength LLM models. By default, we perform scale smoothing
α, to control how much difficulty we want to migrate from for the input activations of self-attention and feed-forward
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Models and datasets. We choose three families of LLMs


to evaluate SmoothQuant: OPT (Zhang et al., 2022),
BLOOM (Scao et al., 2022), and GLM-130B (Zeng
et al., 2022). We use seven zero-shot evaluation tasks:
LAMBADA (Paperno et al., 2016), HellaSwag (Zellers
et al., 2019), PIQA (Bisk et al., 2020), WinoGrande (Sak-
aguchi et al., 2019), OpenBookQA (Mihaylov et al., 2018),
RTE (Wang et al., 2018), COPA (Roemmele et al., 2011),
and one language modeling dataset WikiText (Merity et al.,
2016) to evaluate the OPT and BLOOM models. We use
MMLU (Hendrycks et al., 2020), MNLI (Williams et al.,
2018), QNLI (Wang et al., 2018) and LAMBADA to eval-
uate the GLM-130B model because some of the afore-
Figure 5: SmoothQuant’s precision mapping for a Trans- mentioned benchmarks appear in the training set of GLM-
former block. All compute-intensive operators like linear 130B. We use lm-eval-harness* to evaluate OPT and
layers and batched matmul (BMMs) use INT8 arithmetic. BLOOM models, and GLM-130B’s official repo† for its own
evaluation. Finally, we scale up our method to MT-NLG
530B (Smith et al., 2022) and for the first time enabling the
Table 3: Quantization setting of the baselines and
serving of a >500B model within a single node. Note that
SmoothQuant. All weight and activations use INT8 repre-
we focus on the relative performance change before and
sentations unless specified. For SmoothQuant, the efficiency
after quantization but not the absolute value.
improves from O1 to O3 (i.e., lower latency).
Activation smoothing. The migration strength α = 0.5
Method Weight Activation
is a general sweet spot for all the OPT and BLOOM models,
W8A8 per-tensor per-tensor dynamic and α = 0.75 for GLM-130B since its activations are more
ZeroQuant group-wise per-token dynamic
LLM.int8() per-channel per-token dynamic+FP16 difficult to quantize (Zeng et al., 2022). We get a suitable α
Outlier Suppression per-tensor per-tensor static by running a quick grid search on a subset of the Pile (Gao
SmoothQuant-O1 per-tensor per-token dynamic
et al., 2020) validation set. To get the statistics of activations,
SmoothQuant-O2 per-tensor per-tensor dynamic we calibrate the smoothing factors and the static quantiza-
SmoothQuant-O3 per-tensor per-tensor static tion step sizes once with 512 random sentences from the
pre-training dataset Pile, and apply the same smoothed and
quantized model for all downstream tasks. In this way, we
can benchmark the generality and zero-shot performance of
layers and quantize all linear layers with W8A8. We also
the quantized LLMs.
quantize BMM operators in the attention computation. We de-
sign a quantization flow for transformer blocks in Figure 5.
Implementation. We implement SmoothQuant with two
We quantize the inputs and weights of compute-heavy opera-
backends: (1) PyTorch Huggingface‡ for the proof of con-
tors like linear layers and BMM in attention layers with INT8,
cept, and (2) FasterTransformer§ , as an example of a high-
while keeping the activation as FP16 for other lightweight
performance framework used in production environments.
element-wise operations like ReLU, Softmax, and Layer-
In both PyTorch Huggingface and FasterTransformer frame-
Norm. Such a design helps us to balance accuracy and
works, we implement INT8 linear modules and the batched
inference efficiency.
matrix multiplication (BMM) function with CUTLASS
INT8 GEMM kernels. We simply replace the original floating
5 Experiments point (FP16) linear modules and the bmm function with our
INT8 kernels as the INT8 model.
5.1 Setups
Baselines. We compare with four baselines in the INT8 5.2 Accurate Quantization
post-training quantization setting, i.e., without re-training
Results of OPT-175B. SmoothQuant can handle the quan-
of the model parameters: W8A8 naive quantization, Zero-
tization of very large LLMs, whose activations are more
Quant (Yao et al., 2022), LLM.int8() (Dettmers et al.,
difficult to quantize. We study quantization on OPT-175B.
2022), and Outlier Suppression (Wei et al., 2022). Since
* https://github.com/EleutherAI/lm-evaluation-harness
SmoothQuant is orthogonal to the quantization schemes,

we provide gradually aggressive and efficient quantization https://github.com/THUDM/GLM-130B

levels from O1 to O3. The detailed quantization schemes of https://github.com/huggingface/transformers
§
the baselines and SmoothQuant are shown in Table 3. https://github.com/NVIDIA/FasterTransformer
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Table 4: SmoothQuant maintains the accuracy of OPT-175B model after INT8 quantization, even with the most aggressive
and most efficient O3 setting (Table 3). We extensively benchmark the performance on 7 zero-shot benchmarks (by reporting
the average accuracy) and 1 language modeling benchmark (perplexity). *For ZeroQuant, we also tried leaving the input
activation of self-attention in FP16 and quantizing the rest to INT8, which is their solution to the GPT-NeoX-20B. But this
does not solve the accuracy degradation of OPT-175B.

OPT-175B LAMBADA HellaSwag PIQA WinoGrande OpenBookQA RTE COPA Average↑ WikiText↓
FP16 74.7% 59.3% 79.7% 72.6% 34.0% 59.9% 88.0% 66.9% 10.99
W8A8 0.0% 25.6% 53.4% 50.3% 14.0% 49.5% 56.0% 35.5% 93080
ZeroQuant 0.0%* 26.0% 51.7% 49.3% 17.8% 50.9% 55.0% 35.8% 84648
LLM.int8() 74.7% 59.2% 79.7% 72.1% 34.2% 60.3% 87.0% 66.7% 11.10
Outlier Suppression 0.00% 25.8% 52.5% 48.6% 16.6% 53.4% 55.0% 36.0% 96151
SmoothQuant-O1 74.7% 59.2% 79.7% 71.2% 33.4% 58.1% 89.0% 66.5% 11.11
SmoothQuant-O2 75.0% 59.0% 79.2% 71.2% 33.0% 59.6% 88.0% 66.4% 11.14
SmoothQuant-O3 74.6% 58.9% 79.7% 71.2% 33.4% 59.9% 90.0% 66.8% 11.17

Table 5: SmoothQuant works for different LLMs. We


can quantize the 3 largest, openly available LLM mod-
els into INT8 without degrading the accuracy. For OPT-
175B and BLOOM-176B, we show the average accuracy
on WinoGrande, HellaSwag, PIQA, and LAMBADA. For
GLM-130B we show the average accuracy on LAMBADA,
MMLU, MNLI, and QNLI. *Accuracy is not column-wise
comparable due to different datasets.

Method OPT-175B BLOOM-176B GLM-130B*


FP16 71.6% 68.2% 73.8%
W8A8 32.3% 64.2% 26.9% Figure 6: SmoothQuant-O3 (the most efficient setting, de-
ZeroQuant 31.7% 67.4% 26.7%
LLM.int8() 71.4% 68.0% 73.8% fined in Table 3) preserves the accuracy of OPT models
Outlier Suppression 31.7% 54.1% 63.5% across different scales when quantized to INT8. LLM.int8()
SmoothQuant-O1 71.2% 68.3% 73.7% requires mixed precision and suffers from slowing down.
SmoothQuant-O2 71.1% 68.4% 72.5%
SmoothQuant-O3 71.1% 67.4% 72.8%
tensor static) degrades the average accuracy by 0.8%, which
we attribute to the discrepancy between the statically col-
As shown in Table 4, SmoothQuant can match the FP16 lected statistics and the real evaluation samples’ activation
accuracy on all evaluation datasets with all quantization statistics. On the contrary, the GLM-130B model is more
schemes. LLM.int8() can match the floating point ac- difficult to quantize (which echos Zeng et al.). Nonethe-
curacy because they use floating-point values to represent less, SmoothQuant-O1 can match the FP16 accuracy, while
outliers, which leads to a large latency overhead (Table 8). SmoothQuant-O3 only degrades the accuracy by 1%, which
The W8A8, ZeroQuant, and Outlier Suppression baselines significantly outperforms the baselines. Note that we clip
produce nearly random results, indicating that naively quan- the top 2% tokens when calibrating the static quantization
tizing the activation of LLMs will destroy the performance. step sizes for GLM-130B following Wei et al. (2022). Note
that different model/training designs have different quantiza-
Results of different LLMs. SmoothQuant can be applied tion difficulties, which we hope will inspire future research.
to various LLM designs. In Table 5, we show SmoothQuant
can quantize all existing open LLMs beyond 100B param-
eters. Compared with the OPT-175B model, the BLOOM- Results on LLMs of different sizes. SmoothQuant
176B model is easier to quantize: none of the baselines works not only for the very large LLMs beyond 100B pa-
completely destroys the model; even the naive W8A8 per- rameters, but it also works consistently for smaller LLMs.
tensor dynamic quantization only degrades the accuracy by In Figure 6, we show that SmoothQuant can work on all
4%. The O1 and O2 levels of SmoothQuant successfully scales of OPT models, matching the FP16 accuracy with
maintain the floating point accuracy, while the O3 level (per- INT8 quantization.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

5.3 Speedup and Memory Saving LAMBADA HellaSwag PIQA WinoGrande Average

In this section, we show the measured speedup and mem- FP16 76.6% 62.1% 81.0% 72.9% 73.1%
INT8 77.2% 60.4% 80.7% 74.1% 73.1%
ory saving of SmoothQuant-O3 integrated into PyTorch and
FasterTransformer. We measure the end-to-end latency of
Table 6: SmoothQuant can quantize MT-NLG 530B to
generating all hidden states for a batch of 4 sentences in one
W8A8 with negligible accuracy loss.
pass, i.e., the context stage latency. We record the (aggre-
gated) peak GPU memory usage in this process. We only
compare SmoothQuant with LLM.int8() because it is the
SeqLen Prec. #GPUs Latency Memory
only existing quantization method that can preserve LLM ac-
curacy at all scales. Due to the lack of support for model par- 128 FP16 16 232ms 1040GB
INT8 8 253ms 527GB
allelism in Huggingface, we only measure SmoothQuant’s
performance on a single GPU for the PyTorch implementa- 256 FP16 16 451ms 1054GB
INT8 8 434ms 533GB
tion, so we choose OPT-6.7B, OPT-13B, and OPT-30B for
evaluation. In the FasterTransformer library, SmoothQuant 512 FP16 16 838ms 1068GB
INT8 8 839ms 545GB
can seamlessly work with Tensor Parallelism (Shoeybi et al.,
2019) algorithm, so we test SmoothQuant on OPT-13B, 1024 FP16 16 1707ms 1095GB
INT8 8 1689ms 570GB
OPT-30B, OPT-66B, and OPT-175B for both single and
multi-GPU benchmarks. All our experiments are conducted
Table 7: When serving MT-NLG 530B, SmoothQuant can
on NVIDIA A100 80GB GPU servers.
reduce the memory by half at a similar latency using half
Results of the PyTorch implementation. In Figure 7, we number of GPUs, which allows serving the 530B model
show the inference latency and peak memory usage based within a single node.
on the PyTorch implementation. SmoothQuant is consis-
tently faster than the FP16 baseline, getting a 1.51x speedup
on OPT-30B when the sequence length is 256. We also see
and 7, SmoothQuant enables W8A8 quantization of the
a trend that the larger the model, the more significant the
530B model at a negligible accuracy loss. The reduced
acceleration. On the other hand, LLM.int8() is almost
model size allows us to serve the model using half number
always slower than the FP16 baseline, which is due to the
of the GPUs (16 to 8) at a similar latency, enabling the
large overhead of the mixed-precision activation representa-
serving of a >500B model within a single node (8×A100
tion. In terms of memory, SmoothQuant and LLM.int8()
80GB GPUs).
can all nearly halve the memory usage of the FP16 model,
while SmoothQuant saves slightly more memory because it
uses fully INT8 GEMMs. 5.5 Ablation Study
Quantization schemes. Table 8 shows the inference la-
Results of the FasterTransformer implementation. As tency of different quantization schemes based on our Py-
shown in Figure 8 (top), compared to FasterTransformer’s Torch implementation. We can see that the coarser the
FP16 implementation of OPT, SmoothQuant-O3 can further quantization granularity (from O1 to O3), the lower the la-
reduce the execution latency of OPT-13B and OPT-30B by tency. And static quantization can significantly accelerate
up to 1.56× when using a single GPU. This is challenging inference compared with dynamic quantization because we
since FasterTransformer is already more than 3× faster com- no longer need to calculate the quantization step sizes at
pared to the PyTorch implementation for OPT-30B. Remark- runtime. SmoothQuant is faster than FP16 baseline under
ably, for bigger models that have to be distributed across all settings, while LLM.int8() is usually slower. We
multiple GPUs, SmoothQuant achieves similar or even bet- recommend using a coarser scheme if the accuracy permits.
ter latency using only half the number of GPUs (1 GPU
instead of 2 for OPT-66B, 4 GPUs instead of 8 for OPT-
175B). This could greatly lower the cost of serving LLMs. Migration strength. We need to find a suitable migration
The amount of memory needed when using SmoothQuant- strength α (see Equation 4) to balance the quantization
O3 in FasterTransformer is reduced by a factor of almost difficulty of weights and activations. We ablate the effect
2×, as shown on Figure 8 (bottom). of different α’s on OPT-175B with LAMBADA in Figure 9.
When α is too small (<0.4), the activations are hard to
quantize; when α is too large (>0.6), the weights will be
5.4 Scaling Up: 530B Model Within a Single Node
hard to quantize. Only when we choose α from the sweet
We can further scale up SmoothQuant beyond 500B-level spot region (0.4-0.6) can we get small quantization errors
models, enabling efficient and accurate W8A8 quantization for both weights and activations, and maintain the model
of MT-NLG 530B (Smith et al., 2022). As shown in Table 6 performance after quantization.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Figure 7: The PyTorch implementation of SmoothQuant-O3 achieves up to 1.51× speedup and 1.96× memory saving for
OPT models on a single NVIDIA A100-80GB GPU, while LLM.int8() slows down the inference in most cases.

Figure 8: Inference latency (top) and memory usage (bottom) of the FasterTransformer implementation on NVIDIA
A100-80GB GPUs. For smaller models, the latency can be significantly reduced with SmoothQuant-O3 by up to 1.56x
compared to FP16. For the bigger models (OPT-66B and 175B), we can achieve similar or even faster inference using only
half number of GPUs. Memory footprint is almost halved compared to FP16.

6 Related Work Model quantization. Quantization is an effective method


Large language models (LLMs). Pre-trained language for reducing the model size and accelerating inference. It
models have achieved remarkable performance on various proves to be effective for various convolutional neural works
benchmarks by scaling up. GPT-3 (Brown et al., 2020b) is (CNNs) (Han et al., 2016; Jacob et al., 2018; Nagel et al.,
the first LLM beyond 100B parameters and achieves impres- 2019; Wang et al., 2019; Lin et al., 2020) and transform-
sive few-shot/zero-shot learning results. Later works (Rae ers (Shen et al., 2020; Kim et al., 2021; Liu et al., 2021;
et al., 2021; Smith et al., 2022; Du et al., 2022; Chowdh- Wang et al., 2020; Bondarenko et al., 2021). Weight equal-
ery et al., 2022) continue to push the frontier of scaling, ization (Nagel et al., 2019) and channel splitting (Zhao et al.,
going beyond 500B parameters. However, as the language 2019) reduce quantization error by suppressing the outliers
model gets larger, serving such models for inference be- in weights. However, these techniques cannot address the
comes expensive and challenging. In this work, we show activation outliers, which are the major quantization bottle-
that our proposed method can quantize the three largest, neck for LLMs (Dettmers et al., 2022).
openly available LLMs: OPT-175B (Zhang et al., 2022),
BLOOM-176B (Scao et al., 2022) and GLM-130B (Zeng Quantization of LLMs. GPTQ (Frantar et al., 2022)
et al., 2022), and even MT-NLG 530B (Smith et al., 2022) applies quantization only to weights but not activations
to reduce the memory cost and accelerate inference. (please find a short discussion in Appendix A). Zero-
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Table 8: GPU Latency (ms) of different quantization 7 Conclusion


schemes. The coarser the quantization scheme (from per-
token to per-tensor, dynamic to static, O1 to O3, defined We propose SmoothQuant, an accurate and efficient post-
in Table 3), the lower the latency. SmoothQuant achieves training quantization method to enable lossless 8-bit weight
lower latency compared to FP16 under all settings, while and activation quantization for LLMs up to 530B parameters.
LLM.int8() is mostly slower. The batch size is 4. SmoothQuant enables the quantization for both weight and
activations for all GEMMs in the LLMs, which significantly
Model OPT-13B OPT-30B reduces the inference latency and memory usage compared
with the mixed-precision activation quantization baseline.
Sequence Length 256 512 256 512
We integrate SmoothQuant into PyTorch and FasterTrans-
FP16 152.6 296.3 343.0 659.9 former, getting up to 1.56× inference acceleration and halv-
LLM.int8() 237.1 371.5 387.9 654.9
ing the memory footprint. SmoothQuant democratizes the
SmoothQuant-O1 124.5 243.3 246.7 490.7 application of LLMs by offering a turnkey solution to reduce
SmoothQuant-O2 120.5 235.1 240.2 478.3
SmoothQuant-O3 112.1 223.1 227.6 458.4 the serving cost.

Acknowledgements
We thank MIT-IBM Watson AI Lab, MIT AI Hardware Pro-
gram, Amazon and MIT Science Hub, NVIDIA Academic
Partnership Award, Qualcomm Innovation Fellowship, Mi-
crosoft Turing Academic Program, and NSF for supporting
this research. We thank Haotian Tang, Aohan Zeng, Eric
Lin and Jilei Hou for the helpful discussions.

References
Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y.
Piqa: Reasoning about physical commonsense in natural
Figure 9: A suitable migration strength α (sweet spot) language. In Thirty-Fourth AAAI Conference on Artificial
makes both activations and weights easy to quantize. If Intelligence, 2020.
the α is too large, weights will be hard to quantize; if too
small, activations will be hard to quantize. Bondarenko, Y., Nagel, M., and Blankevoort, T. Under-
standing and overcoming the challenges of efficient trans-
former quantization. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language Pro-
cessing, pp. 7947–7969, Online and Punta Cana, Domini-
Quant (Yao et al., 2022) and nuQmm (Park et al., 2022) can Republic, November 2021. Association for Compu-
use a per-token and group-wise quantization scheme for tational Linguistics. URL https://aclanthology.org/2021.
LLMs, which requires customized CUDA kernels. Their emnlp-main.627.
largest evaluated models are 20B and 2.7B, respectively
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
and fail to maintain the performance of LLMs like OPT-
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
175B. LLM.int8() (Dettmers et al., 2022) uses mixed
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
INT8/FP16 decomposition to address the activation outliers.
Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu,
However, such implementation leads to large latency over-
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin,
head, which can be even slower than FP16 inference. Outlier
M., Gray, S., Chess, B., Clark, J., Berner, C., McCan-
Suppression (Wei et al., 2022) uses the non-scaling Layer-
dlish, S., Radford, A., Sutskever, I., and Amodei, D.
Norm and token-wise clipping to deal with the activation
Language models are few-shot learners. In Larochelle,
outliers. However, it only succeeds on small language mod-
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H.
els such as BERT (Devlin et al., 2019) and BART (Lewis
(eds.), Advances in Neural Information Processing Sys-
et al., 2019) and fails to maintain the accuracy for LLMs (Ta-
tems, volume 33, pp. 1877–1901. Curran Associates, Inc.,
ble 5). Our algorithm preserves the performance of LLMs
2020a. URL https://proceedings.neurips.cc/paper/2020/
(up to 176B, the largest open-source LLM we can find) with
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
an efficient per-tensor, static quantization scheme without
retraining, allowing us to use off-the-shelf INT8 GEMM to Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
achieve high hardware efficiency. Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Askell, A., et al. Language models are few-shot learners. Bart: Denoising sequence-to-sequence pre-training for
Advances in neural information processing systems, 33: natural language generation, translation, and comprehen-
1877–1901, 2020b. sion. arXiv preprint arXiv:1910.13461, 2019.

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Lin, J., Chen, W.-M., Lin, Y., Gan, C., Han, S., et al. Mcunet:
G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Tiny deep learning on iot devices. Advances in Neural
Gehrmann, S., et al. Palm: Scaling language modeling Information Processing Systems, 33:11711–11722, 2020.
with pathways. arXiv preprint arXiv:2204.02311, 2022. Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., and Gao,
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. W. Post-training quantization for vision transformer. Ad-
Llm.int8(): 8-bit matrix multiplication for transformers vances in Neural Information Processing Systems, 34:
at scale. arXiv preprint arXiv:2208.07339, 2022. 28092–28103, 2021.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
pre-training of deep bidirectional transformers for lan- sentinel mixture models, 2016.
guage understanding. In NAACL-HLT 2019, pp. 4171– Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a
4186. Association for Computational Linguistics, 2019. suit of armor conduct electricity? a new dataset for open
book question answering. In EMNLP, 2018.
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu,
Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Nagel, M., Baalen, M. v., Blankevoort, T., and Welling,
Efficient scaling of language models with mixture-of- M. Data-free quantization through weight equalization
experts. In International Conference on Machine Learn- and bias correction. In Proceedings of the IEEE/CVF
ing, pp. 5547–5569. PMLR, 2022. International Conference on Computer Vision, pp. 1325–
1334, 2019.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq:
Accurate post-training quantization for generative pre- Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q.,
trained transformers. arXiv preprint arXiv:2210.17323, Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and
2022. Fernández, R. The LAMBADA dataset: Word prediction
requiring a broad discourse context. In Proceedings of
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., the 54th Annual Meeting of the Association for Compu-
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., tational Linguistics (Volume 1: Long Papers), pp. 1525–
et al. The pile: An 800gb dataset of diverse text for 1534, Berlin, Germany, August 2016. Association for
language modeling. arXiv preprint arXiv:2101.00027, Computational Linguistics. doi: 10.18653/v1/P16-1144.
2020. URL https://aclanthology.org/P16-1144.
Han, S., Mao, H., and Dally, W. J. Deep Compression: Com- Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee,
pressing Deep Neural Networks with Pruning, Trained D. nuqmm: Quantized matmul for efficient inference of
Quantization and Huffman Coding. In ICLR, 2016. large-scale generative language models. arXiv preprint
arXiv:2206.09557, 2022.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury,
language understanding. CoRR, abs/2009.03300, 2020. J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and
URL https://arxiv.org/abs/2009.03300. Dean, J. Efficiently scaling transformer inference. arXiv
preprint arXiv:2211.05102, 2022.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,
A., Adam, H., and Kalenichenko, D. Quantization Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann,
and training of neural networks for efficient integer- J., Song, F., Aslanides, J., Henderson, S., Ring, R.,
arithmetic-only inference. In Proceedings of the IEEE Young, S., et al. Scaling language models: Methods,
Conference on Computer Vision and Pattern Recognition, analysis & insights from training gopher. arXiv preprint
pp. 2704–2713, 2018. arXiv:2112.11446, 2021.

Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice
Keutzer, K. I-bert: Integer-only bert quantization. In of plausible alternatives: An evaluation of commonsense
International conference on machine learning, pp. 5506– causal reasoning. In Logical Formalizations of Common-
5518. PMLR, 2021. sense Reasoning, Papers from the 2011 AAAI Spring Sym-
posium, Technical Report SS-11-06, Stanford, California,
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo- USA, March 21-23, 2011. AAAI, 2011. URL http://www.
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. for Computational Linguistics: Human Language Tech-
Winogrande: An adversarial winograd schema challenge nologies, Volume 1 (Long Papers), pp. 1112–1122. As-
at scale. arXiv preprint arXiv:1907.10641, 2019. sociation for Computational Linguistics, 2018. URL
http://aclweb.org/anthology/N18-1101.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow,
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and
et al. Bloom: A 176b-parameter open-access multilingual He, Y. Zeroquant: Efficient and affordable post-training
language model. arXiv preprint arXiv:2211.05100, 2022. quantization for large-scale transformers, 2022. URL
https://arxiv.org/abs/2206.01861.
Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A.,
Mahoney, M. W., and Keutzer, K. Q-bert: Hessian based Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-
ultra low precision quantization of bert. In Proceedings G. Orca: A distributed serving system for {Transformer-
of the AAAI Conference on Artificial Intelligence, vol- Based} generative models. In 16th USENIX Symposium
ume 34, pp. 8815–8821, 2020. on Operating Systems Design and Implementation (OSDI
22), pp. 521–538, 2022.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
J., and Catanzaro, B. Megatron-lm: Training multi-
Y. Hellaswag: Can a machine really finish your sentence?
billion parameter language models using model par-
CoRR, abs/1905.07830, 2019. URL http://arxiv.org/abs/
allelism. CoRR, abs/1909.08053, 2019. URL http:
1905.07830.
//arxiv.org/abs/1909.08053.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhan- Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b:
dari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., An open bilingual pre-trained model. arXiv preprint
Korthikanti, V., et al. Using deepspeed and megatron to arXiv:2210.02414, 2022.
train megatron-turing nlg 530b, a large-scale generative
language model. arXiv preprint arXiv:2201.11990, 2022. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mi-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, haylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D.,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer,
tention is all you need. Advances in neural information L. Opt: Open pre-trained transformer language models,
processing systems, 30, 2017. 2022. URL https://arxiv.org/abs/2205.01068.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Zhao, R., Hu, Y., Dotzel, J., De Sa, C., and Zhang, Z. Im-
Bowman, S. R. GLUE: A multi-task benchmark and anal- proving neural network quantization without retraining
ysis platform for natural language understanding. CoRR, using outlier channel splitting. In International confer-
abs/1804.07461, 2018. URL http://arxiv.org/abs/1804. ence on machine learning, pp. 7543–7552. PMLR, 2019.
07461.

Wang, H., Zhang, Z., and Han, S. Spatten: Efficient


sparse attention architecture with cascade token and
head pruning. CoRR, abs/2012.09852, 2020. URL
https://arxiv.org/abs/2012.09852.

Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. HAQ:
Hardware-Aware Automated Quantization with Mixed
Precision. In CVPR, 2019.

Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang,
Q., Yu, F., and Liu, X. Outlier suppression: Pushing the
limit of low-bit transformer language models, 2022. URL
https://arxiv.org/abs/2209.13325.

Williams, A., Nangia, N., and Bowman, S. A broad-


coverage challenge corpus for sentence understanding
through inference. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Association
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

A Discussion on Weight-Only Quantization 4. Finally, we think the two settings are somewhat orthog-
onal. We believe we can integrate GPTQ’s method for
In this work, we study W8A8 quantization so that we can a better weight quantization and potentially achieve
utilize INT8 GEMM kernels to increase the throughput and W4A4 quantization, which will lead to even better
accelerate inference. There is another line of work that hardware efficiency (INT4 instructions are supported
only quantizes the weight of LLMs (e.g., GPTQ (Frantar on NVIDIA’s Hopper GPU architecture). We leave this
et al., 2022)). It converts the quantized weights to FP16 exploration to future work.
on the fly for matmul during inference and can also lead to
speed up due to the reduced data loading, especially for the
generation stage with batch size 1.
We mainly compare our method with existing work on
weight-activation quantization (i.e., W8A8) like (Dettmers
et al., 2022; Yao et al., 2022; Wei et al., 2022) since they are
under the same setting. Here we would like to give a short
discussion about the weight-only quantization methods in
LLM settings:

1. Firstly, we were trying to compare our method with


GPTQ (Frantar et al., 2022) but found it difficult due
to different implementations. GPTQ’s low-bit kenerl ¶
only supports the generation stage with batch size 1
(i.e., only processing a single token at a time), and can-
not support the context stage (widely used in different
downstream tasks and chatbot) or batch-based setting.
Furthermore, its low-bit kernel optimization only tar-
gets the OPT-175B model (as stated in the README).
At the same time, our work utilizes FasterTransformer
for serving large models, which may lead to an unfair
advantage if we make a direct comparison.

2. GPTQ may perform better at handling a small number


of input tokens (1 in its experiments) since the process
is highly memory-bounded. In contrast, SmoothQuant
may serve better with a batching setting or for the con-
text stage (i.e., when the number of processed tokens
is more significant). Nonetheless, some work shows
that in production, we can improve the throughput of
serving GPT models by 37× at similar latency with
advanced batching (Yu et al., 2022). We believe in
production, batching will be the future standard, and
SmoothQuant will bring further improvement, even for
the generation stage.

3. Applications like chatbots need to handle a long con-


text length and potentially run under a batch setting.
Due to the two factors, the memory size of the KV
cache can no longer be ignored (as shown in (Pope
et al., 2022), the KV cache totals 3TB given batch size
512 and context length 2048, which is 3× larger than
the model weights). In this case, quantization of activa-
tion can also help reduce the memory cost from storing
the KV cache.

https://github.com/IST-DASLab/gptq

You might also like