I-BERT: Integer-only BERT Quantization - arXiv

I-BERT: Integer-only BERT QuantizationSehoon Kim* 1 Amir Gholami* 1 Zhewei Yao* 1 Michael W. Mahoney1 Kurt Keutzer1 AbstractTransformer based models, like BERT andRoBERTa, have achieved state-of-the-art resultsin many Natural Language Processing tasks. How-ever, their memory footprint, inference latency,and power consumption are prohibitive for effi-cient inference at the edge, and even at the datacenter. While Quantization can be a viable solu-tion for this, previous work on quantizing Trans-former based models use floating-point arithmeticduring inference, which cannot efficiently utilizeinteger-only logical units such as the recent Tur-ing Tensor Cores, or traditional Integer-only ARMprocessors. In this work, we propose I-BERT, anovel Quantization scheme for Transformer basedmodels that quantizes the entire inference withinteger-only arithmetic.

Based on lightweightinteger-only approximation methods for nonlin-ear operations, , GELU, Softmax, and LayerNormalization, I-BERT performs an end-to-endinteger-only BERT inference without any float-ing point calculation. We evaluate our approachon GLUE downstream tasks using RoBERTa-Base/Large. We show that for both cases,I-BERT achieves similar (and slightly higher) accuracy ascompared to the full-precision baseline. Further-more, our preliminary implementation of I-BERT shows a speedup for INT8 infer-ence on a T4 GPU system as compared to FP32inference. The framework has been developed inPyTorch and has been open-sourced (Kim, 2021).1. IntroductionThe recent Transformer based Neural Network (NN) mod-els (Vaswani et al., 2017), pre-trained from large unlabeleddata ( , BERT (Devlin et al.))

, 2018), RoBERTa (Liu et al.,*Equal contribution1 University of California, Berkeley. Cor-respondence to:Sehoon Kim Kurt Keutzer of the38thInternational Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).2019), and the GPT family (Brown et al., 2020; Radfordet al., 2018; 2019)), have achieved a significant accuracyimprovement when fine-tuned on a wide range of NaturalLanguage Processing (NLP) tasks such as sentence classi-fication (Wang et al., 2018) and question answering (Ra-jpurkar et al., 2016). Despite the state-of-the-art resultsin various NLP tasks, pre-trained Transformer models aregenerally orders of magnitude larger than prior models. Forexample, the BERT-Large model (Devlin et al., 2018) con-tains 340M parameters.

Much larger Transformer modelshave been introduced in the past few years, with even moreparameters (Brown et al., 2020; Lepikhin et al., 2020; Rad-ford et al., 2019; Raffel et al., 2019; Rosset, 2019; Shoeybiet al., 2019; Yang et al., 2019). Efficient deployment ofthese models has become a major challenge, even in datacenters, due to limited resources (energy, memory footprint,and compute) and the need for real-time inference. Obvi-ously, these challenges are greater for edge devices, wherethe compute and energy resources are more promising method to tackle this challenge is quantiza-tion (Dong et al., 2019; Jacob et al., 2018; Krishnamoorthi,2018; Wu et al., 2018; 2016; Zhang et al., 2018), a pro-cedure which compresses NN models into smaller size byrepresenting parameters and/or activations with low bit pre-cision, , 8-bit integer (INT8) instead of 32-bit floatingpoint (FP32).

Quantization reduces memory footprint bystoring parameters/activations in low precision. With the re-cent Integer-only Quantization methods, one can also benefitfrom faster inference speed by using low precision integermultiplication and accumulation, instead of floating pointarithmetic. However, previous Quantization schemes forTransformer based models use simulated Quantization (akafake Quantization ), where all or part of operations in theinference ( , GELU (Hendrycks & Gimpel, 2016), Soft-max, and Layer Normalization (Ba et al., 2016)) are carriedout with floating point arithmetic (Bhandare et al., 2019;Shen et al., 2020; Zafrir et al., 2019). This approach hasmultiple drawbacks for deployment in real edge applica-tion scenarios. Most importantly, the resulting NN modelscannot be deployed on neural accelerators or popular edgeprocessors that do not support floating point arithmetic.

Forinstance, the recent server class of Turing Tensor Coreshave added high throughput integer logic that are fasterthan single/half-precision. Similarly, some of the edge pro- [ ] 8 Jun 2021I-BERT: Integer-only BERT Quantizationcessor cores in ARM Cortex-M (ARM, 2020) family forembedded systems only contain integer arithmetic units, andthey can only support NN deployment with the integer-onlykernels (Lai et al., 2018). Moreover, one has to considerthat compared to the Integer-only inference, the approachesthat use floating point arithmetic are inferior in latency andpower efficiency. For chip designers wishing to supportBERT-like models, adding floating point arithmetic logicoccupies larger die area on a chip, as compared to integerarithmetic logic.

Thus, the complete removal of floatingpoint arithmetic for inference could have a major impact ondesigning applications, software, and hardware for efficientinference at the edge (ARM, 2020).While prior work has shown the feasibility of integer-onlyinference (Jacob et al., 2018; Yao et al., 2020), these ap-proaches have only focused on models in computer vi-sion with simple CNN layers, Batch Normalization (Batch-Norm) (Ioffe & Szegedy, 2015), and ReLU are all linear or piece-wise linear operators. Due to thenon-linear operations used in Transformer architecture, ,GELU, Softmax, and Layer Normalization (LayerNorm),these methods cannot be applied to Transformer based mod-els. Unlike ReLU, computing GELU and Softmax withinteger-only arithmetic is not straightforward, due to theirnon-linearity.

Furthermore, unlike BatchNorm whose pa-rameters/statistics can be fused into the previous convolu-tional layer in inference, LayerNorm requires the dynamiccomputation of the square root of the variance for each cannot be na vely computed with Integer-only arith-metic. Another challenge is that processing GELU, Softmax,and LayerNorm with low precision can result in signifciantaccuracy degradation (Bhandare et al., 2019; Zafrir et al.,2019). For these reasons, other Quantization methods suchas (Bhandare et al., 2019; Shen et al., 2020; Zafrir et al.,2019) keep these operations in FP32 this work, we proposeI-BERTto address these a series of novel integer-onlyquantization scheme for Transformer based models. Specifi-cally, our contributions are: We propose new kernels for the efficient and accurateinteger-only computation of GELU and Softmax.

In par-ticular, we approximate GELU and Softmax with light-weight second-order polynomials, which can be evaluatedwith Integer-only arithmetic. We utilize different tech-niques to improve the approximation error, and achieve amaximum error 10 2for GELU, 10 3for Softmax. See and for details. For LayerNorm, we perform Integer-only computation byleveraging a known algorithm for integer calculation ofsquare root (Crandall & Pomerance, 2006). See fordetails. We use these approximations of GELU, Softmax, andLayerNorm to design Integer-only Quantization for Trans-former based models. Specifically, we process Embeddingand matrix multiplication (MatMul) with INT8 multiplica-tion and INT32 accumulation. The following non-linearoperations (GELU, Softmax, and LayerNorm) are thencalculated on the INT32 accumulated result and then re-quantized back to INT8.

We represent all parameters andactivations in the entire computational graph with integers,and we never cast them into floating point. See Fig. 1(right) for a schematic description. We applyI-BERTto RoBERTa-Base/Large, and we eval-uate their accuracy on the GLUE (Wang et al., 2018)downstream similar results ascompared to full-precision baseline. Specifically,I-BERT outperforms the baseline by and on the GLUE downstream tasks for RoBERTa-Base and RoBERTa-Large, respectively. See Tab. 2 in for details. We deploy INT8 BERT models with the Integer-only ker-nels for non-linear operations on a T4 GPU using Ten-sorRT (NVIDIA, 2018). We show that INT8 inferenceachieves up to 4 speedup as compared to FP32 Tab. 3 in for Related WorkEfficient Neural are several differentapproaches to reduce the memory footprint, latency, andpower of modern NN architectures.

I-BERT: Integer-only BERT Quantization - arXiv

Tags:

Information

Advertisement

Transcription of I-BERT: Integer-only BERT Quantization - arXiv

Related search queries

I-BERT: Integer-only BERT Quantization - arXiv

Tags:

Information

Advertisement

Documents from same domain

Related documents

Related search queries