Smoothquant: Accurate and Efficient Post-Training Quantization For Large Language Models
Smoothquant: Accurate and Efficient Post-Training Quantization For Large Language Models
Smoothquant: Accurate and Efficient Post-Training Quantization For Large Language Models
Guangxuan Xiao * 1 Ji Lin * 1 Mickael Seznec 2 Hao Wu 2 Julien Demouth 2 Song Han 1
ΔX[1]
s [1×Cbits
low effective 0] Co
Co T X
Ci
0 0
* Ci W
T hard to quantize very easy to quantize
X Co
Ci (a) Original
(a) per-tensor quantization
* W
smoothed | X̂ | migrate difficulty
ΔX[T×1] ΔW[1×C0]
| Ŵ | Ci
1 1
Co
quant. levels
T X
* Ci
W
0 0
easy to quantize easy to quantize per-token quant. per-channel quant.
(b) per-token + per-channel quantization
(b) SmoothQuant
Figure 2: Definition of per-tensor, per-token, and per-
Figure 1: SmoothQuant’s intuition: the activation X is hard channel quantization. Per-tensor quantization is the most
to quantize because outliers stretch the quantization range, efficient to implement. For vector-wise quantization to ef-
leaving few effective bits for most values. We migrate the ficiently utilize the INT8 GEMM kernels, we can only use
scale variance from activations to weights W during offline scaling factors from the outer dimensions (i.e., token di-
to reduce the quantization difficulty of activations. The mension T and out channel dimension Co ) but not inner
smoothed activation X̂ and the adjusted weight Ŵ are both dimension (i.e., in channel dimension Ci ).
easy to quantize.
quantization uses a single step size for the entire matrix. We Table 2: Among different activation quantization schemes,
can further enable finer-grained quantization by using dif- only per-channel quantization (Bondarenko et al., 2021) pre-
ferent quantization step sizes for activations associated with serves the accuracy, but it is not compatible (marked in gray)
each token (per-token quantization) or each output channel with INT8 GEMM kernels. We report the average accuracy
of weights (per-channel quantization). A coarse-grained on WinoGrande, HellaSwag, PIQA, and LAMBADA.
version of per-channel quantization is to use different quanti-
zation steps for different channel groups, called group-wise Model size (OPT-) 6.7B 13B 30B 66B 175B
quantization (Shen et al., 2020; Yao et al., 2022). FP16 64.9% 65.6% 67.9% 69.5% 71.6%
For a linear layer in Transformers (Vaswani et al., 2017) INT8 per-tensor 39.9% 33.0% 32.8% 33.1% 32.3%
INT8 per-token 42.5% 33.0% 33.1% 32.9% 31.7%
Y = X · W, Y ∈ RT ×Co , X ∈ RT ×Ci , W ∈ RCi ×Co , INT8 per-channel 64.8% 65.6% 68.0% 69.4% 71.4%
where T is the number of tokens, Ci is the input channel,
and Co is the output channel (see Figure 2, we omit the
batch dimension for simplicity), we can reduce the storage given channel across tokens is small (outlier channels are
by half compared to FP16 by quantizing the weights to INT8. consistently large). Due to the persistence of outliers
However, to speed up the inference, we need to quantize and the small variance inside each channel, if we could per-
both weights and activations into INT8 (i.e., W8A8) to form per-channel quantization (Bondarenko et al., 2021) of
utilize the integer kernels (e.g., INT8 GEMM), which are the activation (i.e., using a different quantization step for
supported by a wide range of hardware (e.g., NVIDIA GPUs, each channel), the quantization error would be much smaller
Intel CPUs, Qualcomm DSPs, etc.). compared to per-tensor quantization, while per-token quan-
tization helps little. In Table 2, we verify the assumption
3 Review of Quantization Difficulty that simulated per-channel activation quantization success-
fully bridges the accuracy with the FP16 baseline, which
LLMs are notoriously difficult to quantize due to the outliers echos the findings of Bondarenko et al..
in the activations (Dettmers et al., 2022; Wei et al., 2022;
Bondarenko et al., 2021). We first review the difficulties However, per-channel activation quantization does not map
of activation quantization and look for a pattern amongst well to hardware-accelerated GEMM kernels, that rely on a
outliers. We visualize the input activations and the weights sequence of operations executed at a high throughput (e.g.,
of a linear layer that has a large quantization error in Figure 3 Tensor Core MMAs) and do not tolerate the insertion of
(left). We can find several patterns that motivate our method: instructions with a lower throughput (e.g., conversions or
CUDA Core FMAs) in that sequence. In those kernels, scal-
1. Activations are harder to quantize than weights. The ing can only be performed along the outer dimensions of the
weight distribution is quite uniform and flat, which is easy matrix multiplication (i.e., token dimension of activations
to quantize. Previous work has shown that quantizing the T , output channel dimension of weights Co , see Figure 2),
weights of LLMs with INT8 or even with INT4 does not which can be applied after the matrix multiplication finishes:
degrade accuracy (Dettmers et al., 2022; Yao et al., 2022;
Zeng et al., 2022), which echoes our observation.
Y = diag(∆FP16
X ) · (X̄
INT8
· W̄INT8 ) · diag(∆FP16
W ) (2)
2. Outliers make activation quantization difficult. The
scale of outliers in activations is ∼ 100× larger than most of Therefore, previous works all use per-token activation quan-
the activation values. In the case of per-tensor quantization tization for linear layers (Dettmers et al., 2022; Yao et al.,
(Equation 1), the large outliers dominate the maximum mag- 2022), although they cannot address the difficulty of activa-
nitude measurement, leading to low effective quantization tion quantization (only slightly better than per-tensor).
bits/levels (Figure 1) for non-outlier channels: suppose the
maximum magnitude of channel i is mi , and the maximum
value of the whole matrix is m, the effective quantization 4 SmoothQuant
levels of channel i is 28 · mi /m. For non-outlier channels, Instead of per-channel activation quantization (which is
the effective quantization levels would be very small (2-3), infeasible), we propose to “smooth” the input activation
leading to large quantization errors. by dividing it by a per-channel smoothing factor s ∈ RCi .
3. Outliers persist in fixed channels. Outliers appear To keep the mathematical equivalence of a linear layer, we
in a small fraction of the channels. If one channel has an scale the weights accordingly in the reversed direction:
outlier, it persistently appears in all tokens (Figure 3, red).
The variance amongst the channels for a given token is large Y = (Xdiag(s)−1 ) · (diag(s)W) = X̂Ŵ (3)
(the activations in some channels are very large, but most Considering input X is usually produced from previous
are small), but the variance between the magnitudes of a linear operations (e.g., linear layers, layer norms, etc.), we
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Figure 3: Magnitude of the input activations and weights of a linear layer in OPT-13B before and after SmoothQuant.
Observations: (1) there are a few channels in the original activation map whose magnitudes are very large (greater than 70);
(2) the variance in one activation channel is small; (3) the original weight distribution is flat and uniform. SmoothQuant
migrates the outlier channels from activation to weight. In the end, the outliers in the activation are greatly smoothed while
the weight is still pretty smooth and flat.
Table 4: SmoothQuant maintains the accuracy of OPT-175B model after INT8 quantization, even with the most aggressive
and most efficient O3 setting (Table 3). We extensively benchmark the performance on 7 zero-shot benchmarks (by reporting
the average accuracy) and 1 language modeling benchmark (perplexity). *For ZeroQuant, we also tried leaving the input
activation of self-attention in FP16 and quantizing the rest to INT8, which is their solution to the GPT-NeoX-20B. But this
does not solve the accuracy degradation of OPT-175B.
OPT-175B LAMBADA HellaSwag PIQA WinoGrande OpenBookQA RTE COPA Average↑ WikiText↓
FP16 74.7% 59.3% 79.7% 72.6% 34.0% 59.9% 88.0% 66.9% 10.99
W8A8 0.0% 25.6% 53.4% 50.3% 14.0% 49.5% 56.0% 35.5% 93080
ZeroQuant 0.0%* 26.0% 51.7% 49.3% 17.8% 50.9% 55.0% 35.8% 84648
LLM.int8() 74.7% 59.2% 79.7% 72.1% 34.2% 60.3% 87.0% 66.7% 11.10
Outlier Suppression 0.00% 25.8% 52.5% 48.6% 16.6% 53.4% 55.0% 36.0% 96151
SmoothQuant-O1 74.7% 59.2% 79.7% 71.2% 33.4% 58.1% 89.0% 66.5% 11.11
SmoothQuant-O2 75.0% 59.0% 79.2% 71.2% 33.0% 59.6% 88.0% 66.4% 11.14
SmoothQuant-O3 74.6% 58.9% 79.7% 71.2% 33.4% 59.9% 90.0% 66.8% 11.17
5.3 Speedup and Memory Saving LAMBADA HellaSwag PIQA WinoGrande Average
In this section, we show the measured speedup and mem- FP16 76.6% 62.1% 81.0% 72.9% 73.1%
INT8 77.2% 60.4% 80.7% 74.1% 73.1%
ory saving of SmoothQuant-O3 integrated into PyTorch and
FasterTransformer. We measure the end-to-end latency of
Table 6: SmoothQuant can quantize MT-NLG 530B to
generating all hidden states for a batch of 4 sentences in one
W8A8 with negligible accuracy loss.
pass, i.e., the context stage latency. We record the (aggre-
gated) peak GPU memory usage in this process. We only
compare SmoothQuant with LLM.int8() because it is the
SeqLen Prec. #GPUs Latency Memory
only existing quantization method that can preserve LLM ac-
curacy at all scales. Due to the lack of support for model par- 128 FP16 16 232ms 1040GB
INT8 8 253ms 527GB
allelism in Huggingface, we only measure SmoothQuant’s
performance on a single GPU for the PyTorch implementa- 256 FP16 16 451ms 1054GB
INT8 8 434ms 533GB
tion, so we choose OPT-6.7B, OPT-13B, and OPT-30B for
evaluation. In the FasterTransformer library, SmoothQuant 512 FP16 16 838ms 1068GB
INT8 8 839ms 545GB
can seamlessly work with Tensor Parallelism (Shoeybi et al.,
2019) algorithm, so we test SmoothQuant on OPT-13B, 1024 FP16 16 1707ms 1095GB
INT8 8 1689ms 570GB
OPT-30B, OPT-66B, and OPT-175B for both single and
multi-GPU benchmarks. All our experiments are conducted
Table 7: When serving MT-NLG 530B, SmoothQuant can
on NVIDIA A100 80GB GPU servers.
reduce the memory by half at a similar latency using half
Results of the PyTorch implementation. In Figure 7, we number of GPUs, which allows serving the 530B model
show the inference latency and peak memory usage based within a single node.
on the PyTorch implementation. SmoothQuant is consis-
tently faster than the FP16 baseline, getting a 1.51x speedup
on OPT-30B when the sequence length is 256. We also see
and 7, SmoothQuant enables W8A8 quantization of the
a trend that the larger the model, the more significant the
530B model at a negligible accuracy loss. The reduced
acceleration. On the other hand, LLM.int8() is almost
model size allows us to serve the model using half number
always slower than the FP16 baseline, which is due to the
of the GPUs (16 to 8) at a similar latency, enabling the
large overhead of the mixed-precision activation representa-
serving of a >500B model within a single node (8×A100
tion. In terms of memory, SmoothQuant and LLM.int8()
80GB GPUs).
can all nearly halve the memory usage of the FP16 model,
while SmoothQuant saves slightly more memory because it
uses fully INT8 GEMMs. 5.5 Ablation Study
Quantization schemes. Table 8 shows the inference la-
Results of the FasterTransformer implementation. As tency of different quantization schemes based on our Py-
shown in Figure 8 (top), compared to FasterTransformer’s Torch implementation. We can see that the coarser the
FP16 implementation of OPT, SmoothQuant-O3 can further quantization granularity (from O1 to O3), the lower the la-
reduce the execution latency of OPT-13B and OPT-30B by tency. And static quantization can significantly accelerate
up to 1.56× when using a single GPU. This is challenging inference compared with dynamic quantization because we
since FasterTransformer is already more than 3× faster com- no longer need to calculate the quantization step sizes at
pared to the PyTorch implementation for OPT-30B. Remark- runtime. SmoothQuant is faster than FP16 baseline under
ably, for bigger models that have to be distributed across all settings, while LLM.int8() is usually slower. We
multiple GPUs, SmoothQuant achieves similar or even bet- recommend using a coarser scheme if the accuracy permits.
ter latency using only half the number of GPUs (1 GPU
instead of 2 for OPT-66B, 4 GPUs instead of 8 for OPT-
175B). This could greatly lower the cost of serving LLMs. Migration strength. We need to find a suitable migration
The amount of memory needed when using SmoothQuant- strength α (see Equation 4) to balance the quantization
O3 in FasterTransformer is reduced by a factor of almost difficulty of weights and activations. We ablate the effect
2×, as shown on Figure 8 (bottom). of different α’s on OPT-175B with LAMBADA in Figure 9.
When α is too small (<0.4), the activations are hard to
quantize; when α is too large (>0.6), the weights will be
5.4 Scaling Up: 530B Model Within a Single Node
hard to quantize. Only when we choose α from the sweet
We can further scale up SmoothQuant beyond 500B-level spot region (0.4-0.6) can we get small quantization errors
models, enabling efficient and accurate W8A8 quantization for both weights and activations, and maintain the model
of MT-NLG 530B (Smith et al., 2022). As shown in Table 6 performance after quantization.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Figure 7: The PyTorch implementation of SmoothQuant-O3 achieves up to 1.51× speedup and 1.96× memory saving for
OPT models on a single NVIDIA A100-80GB GPU, while LLM.int8() slows down the inference in most cases.
Figure 8: Inference latency (top) and memory usage (bottom) of the FasterTransformer implementation on NVIDIA
A100-80GB GPUs. For smaller models, the latency can be significantly reduced with SmoothQuant-O3 by up to 1.56x
compared to FP16. For the bigger models (OPT-66B and 175B), we can achieve similar or even faster inference using only
half number of GPUs. Memory footprint is almost halved compared to FP16.
Acknowledgements
We thank MIT-IBM Watson AI Lab, MIT AI Hardware Pro-
gram, Amazon and MIT Science Hub, NVIDIA Academic
Partnership Award, Qualcomm Innovation Fellowship, Mi-
crosoft Turing Academic Program, and NSF for supporting
this research. We thank Haotian Tang, Aohan Zeng, Eric
Lin and Jilei Hou for the helpful discussions.
References
Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y.
Piqa: Reasoning about physical commonsense in natural
Figure 9: A suitable migration strength α (sweet spot) language. In Thirty-Fourth AAAI Conference on Artificial
makes both activations and weights easy to quantize. If Intelligence, 2020.
the α is too large, weights will be hard to quantize; if too
small, activations will be hard to quantize. Bondarenko, Y., Nagel, M., and Blankevoort, T. Under-
standing and overcoming the challenges of efficient trans-
former quantization. In Proceedings of the 2021 Con-
ference on Empirical Methods in Natural Language Pro-
cessing, pp. 7947–7969, Online and Punta Cana, Domini-
Quant (Yao et al., 2022) and nuQmm (Park et al., 2022) can Republic, November 2021. Association for Compu-
use a per-token and group-wise quantization scheme for tational Linguistics. URL https://aclanthology.org/2021.
LLMs, which requires customized CUDA kernels. Their emnlp-main.627.
largest evaluated models are 20B and 2.7B, respectively
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
and fail to maintain the performance of LLMs like OPT-
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
175B. LLM.int8() (Dettmers et al., 2022) uses mixed
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
INT8/FP16 decomposition to address the activation outliers.
Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu,
However, such implementation leads to large latency over-
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin,
head, which can be even slower than FP16 inference. Outlier
M., Gray, S., Chess, B., Clark, J., Berner, C., McCan-
Suppression (Wei et al., 2022) uses the non-scaling Layer-
dlish, S., Radford, A., Sutskever, I., and Amodei, D.
Norm and token-wise clipping to deal with the activation
Language models are few-shot learners. In Larochelle,
outliers. However, it only succeeds on small language mod-
H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H.
els such as BERT (Devlin et al., 2019) and BART (Lewis
(eds.), Advances in Neural Information Processing Sys-
et al., 2019) and fails to maintain the accuracy for LLMs (Ta-
tems, volume 33, pp. 1877–1901. Curran Associates, Inc.,
ble 5). Our algorithm preserves the performance of LLMs
2020a. URL https://proceedings.neurips.cc/paper/2020/
(up to 176B, the largest open-source LLM we can find) with
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
an efficient per-tensor, static quantization scheme without
retraining, allowing us to use off-the-shelf INT8 GEMM to Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
achieve high hardware efficiency. Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Askell, A., et al. Language models are few-shot learners. Bart: Denoising sequence-to-sequence pre-training for
Advances in neural information processing systems, 33: natural language generation, translation, and comprehen-
1877–1901, 2020b. sion. arXiv preprint arXiv:1910.13461, 2019.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, Lin, J., Chen, W.-M., Lin, Y., Gan, C., Han, S., et al. Mcunet:
G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Tiny deep learning on iot devices. Advances in Neural
Gehrmann, S., et al. Palm: Scaling language modeling Information Processing Systems, 33:11711–11722, 2020.
with pathways. arXiv preprint arXiv:2204.02311, 2022. Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., and Gao,
Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. W. Post-training quantization for vision transformer. Ad-
Llm.int8(): 8-bit matrix multiplication for transformers vances in Neural Information Processing Systems, 34:
at scale. arXiv preprint arXiv:2208.07339, 2022. 28092–28103, 2021.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer
pre-training of deep bidirectional transformers for lan- sentinel mixture models, 2016.
guage understanding. In NAACL-HLT 2019, pp. 4171– Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a
4186. Association for Computational Linguistics, 2019. suit of armor conduct electricity? a new dataset for open
book question answering. In EMNLP, 2018.
Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu,
Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Nagel, M., Baalen, M. v., Blankevoort, T., and Welling,
Efficient scaling of language models with mixture-of- M. Data-free quantization through weight equalization
experts. In International Conference on Machine Learn- and bias correction. In Proceedings of the IEEE/CVF
ing, pp. 5547–5569. PMLR, 2022. International Conference on Computer Vision, pp. 1325–
1334, 2019.
Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq:
Accurate post-training quantization for generative pre- Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q.,
trained transformers. arXiv preprint arXiv:2210.17323, Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and
2022. Fernández, R. The LAMBADA dataset: Word prediction
requiring a broad discourse context. In Proceedings of
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., the 54th Annual Meeting of the Association for Compu-
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., tational Linguistics (Volume 1: Long Papers), pp. 1525–
et al. The pile: An 800gb dataset of diverse text for 1534, Berlin, Germany, August 2016. Association for
language modeling. arXiv preprint arXiv:2101.00027, Computational Linguistics. doi: 10.18653/v1/P16-1144.
2020. URL https://aclanthology.org/P16-1144.
Han, S., Mao, H., and Dally, W. J. Deep Compression: Com- Park, G., Park, B., Kwon, S. J., Kim, B., Lee, Y., and Lee,
pressing Deep Neural Networks with Pruning, Trained D. nuqmm: Quantized matmul for efficient inference of
Quantization and Huffman Coding. In ICLR, 2016. large-scale generative language models. arXiv preprint
arXiv:2206.09557, 2022.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M.,
Song, D., and Steinhardt, J. Measuring massive multitask Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury,
language understanding. CoRR, abs/2009.03300, 2020. J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and
URL https://arxiv.org/abs/2009.03300. Dean, J. Efficiently scaling transformer inference. arXiv
preprint arXiv:2211.05102, 2022.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,
A., Adam, H., and Kalenichenko, D. Quantization Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann,
and training of neural networks for efficient integer- J., Song, F., Aslanides, J., Henderson, S., Ring, R.,
arithmetic-only inference. In Proceedings of the IEEE Young, S., et al. Scaling language models: Methods,
Conference on Computer Vision and Pattern Recognition, analysis & insights from training gopher. arXiv preprint
pp. 2704–2713, 2018. arXiv:2112.11446, 2021.
Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Roemmele, M., Bejan, C. A., and Gordon, A. S. Choice
Keutzer, K. I-bert: Integer-only bert quantization. In of plausible alternatives: An evaluation of commonsense
International conference on machine learning, pp. 5506– causal reasoning. In Logical Formalizations of Common-
5518. PMLR, 2021. sense Reasoning, Papers from the 2011 AAAI Spring Sym-
posium, Technical Report SS-11-06, Stanford, California,
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mo- USA, March 21-23, 2011. AAAI, 2011. URL http://www.
hamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. for Computational Linguistics: Human Language Tech-
Winogrande: An adversarial winograd schema challenge nologies, Volume 1 (Long Papers), pp. 1112–1122. As-
at scale. arXiv preprint arXiv:1907.10641, 2019. sociation for Computational Linguistics, 2018. URL
http://aclweb.org/anthology/N18-1101.
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow,
D., Castagné, R., Luccioni, A. S., Yvon, F., Gallé, M., Yao, Z., Aminabadi, R. Y., Zhang, M., Wu, X., Li, C., and
et al. Bloom: A 176b-parameter open-access multilingual He, Y. Zeroquant: Efficient and affordable post-training
language model. arXiv preprint arXiv:2211.05100, 2022. quantization for large-scale transformers, 2022. URL
https://arxiv.org/abs/2206.01861.
Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A.,
Mahoney, M. W., and Keutzer, K. Q-bert: Hessian based Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-
ultra low precision quantization of bert. In Proceedings G. Orca: A distributed serving system for {Transformer-
of the AAAI Conference on Artificial Intelligence, vol- Based} generative models. In 16th USENIX Symposium
ume 34, pp. 8815–8821, 2020. on Operating Systems Design and Implementation (OSDI
22), pp. 521–538, 2022.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper,
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
J., and Catanzaro, B. Megatron-lm: Training multi-
Y. Hellaswag: Can a machine really finish your sentence?
billion parameter language models using model par-
CoRR, abs/1905.07830, 2019. URL http://arxiv.org/abs/
allelism. CoRR, abs/1909.08053, 2019. URL http:
1905.07830.
//arxiv.org/abs/1909.08053.
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M.,
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhan- Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. Glm-130b:
dari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., An open bilingual pre-trained model. arXiv preprint
Korthikanti, V., et al. Using deepspeed and megatron to arXiv:2210.02414, 2022.
train megatron-turing nlg 530b, a large-scale generative
language model. arXiv preprint arXiv:2201.11990, 2022. Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., Mi-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, haylov, T., Ott, M., Shleifer, S., Shuster, K., Simig, D.,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer,
tention is all you need. Advances in neural information L. Opt: Open pre-trained transformer language models,
processing systems, 30, 2017. 2022. URL https://arxiv.org/abs/2205.01068.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Zhao, R., Hu, Y., Dotzel, J., De Sa, C., and Zhang, Z. Im-
Bowman, S. R. GLUE: A multi-task benchmark and anal- proving neural network quantization without retraining
ysis platform for natural language understanding. CoRR, using outlier channel splitting. In International confer-
abs/1804.07461, 2018. URL http://arxiv.org/abs/1804. ence on machine learning, pp. 7543–7552. PMLR, 2019.
07461.
Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. HAQ:
Hardware-Aware Automated Quantization with Mixed
Precision. In CVPR, 2019.
Wei, X., Zhang, Y., Zhang, X., Gong, R., Zhang, S., Zhang,
Q., Yu, F., and Liu, X. Outlier suppression: Pushing the
limit of low-bit transformer language models, 2022. URL
https://arxiv.org/abs/2209.13325.
A Discussion on Weight-Only Quantization 4. Finally, we think the two settings are somewhat orthog-
onal. We believe we can integrate GPTQ’s method for
In this work, we study W8A8 quantization so that we can a better weight quantization and potentially achieve
utilize INT8 GEMM kernels to increase the throughput and W4A4 quantization, which will lead to even better
accelerate inference. There is another line of work that hardware efficiency (INT4 instructions are supported
only quantizes the weight of LLMs (e.g., GPTQ (Frantar on NVIDIA’s Hopper GPU architecture). We leave this
et al., 2022)). It converts the quantized weights to FP16 exploration to future work.
on the fly for matmul during inference and can also lead to
speed up due to the reduced data loading, especially for the
generation stage with batch size 1.
We mainly compare our method with existing work on
weight-activation quantization (i.e., W8A8) like (Dettmers
et al., 2022; Yao et al., 2022; Wei et al., 2022) since they are
under the same setting. Here we would like to give a short
discussion about the weight-only quantization methods in
LLM settings: