Transcription of I-BERT: Integer-only BERT Quantization - arXiv
{{id}} {{{paragraph}}}
I-BERT: Integer-only BERT QuantizationSehoon Kim* 1 Amir Gholami* 1 Zhewei Yao* 1 Michael W. Mahoney1 Kurt Keutzer1 AbstractTransformer based models, like BERT andRoBERTa, have achieved state-of-the-art resultsin many Natural Language Processing tasks. How-ever, their memory footprint, inference latency,and power consumption are prohibitive for effi-cient inference at the edge, and even at the datacenter. While Quantization can be a viable solu-tion for this, previous work on quantizing Trans-former based models use floating-point arithmeticduring inference, which cannot efficiently utilizeinteger-only logical units such as the recent Tur-ing Tensor Cores, or traditional Integer-only ARMprocessors. In this work, we propose I-BERT, anovel Quantization scheme for Transformer basedmodels that quantizes the entire inference withinteger-only arithmetic.
quantization of BERT. However, to the best of our knowledge, all of the prior quantization work on Transformer based models use simu-lated quantization (aka fake quantization), where all or part of operations are performed with floating point arithmetic. This requires the quantized parameters and/or activations
Domain:
Source:
Link to this page:
Please notify us if you found a problem with this document:
{{id}} {{{paragraph}}}