Introduction
Several papers have come out recently showing how to run large language models with much less memory so they can be and infer on smaller devices such as LLM.int8() and QLoRA. I wanted to better understand how they work and also apply them to transformer vision models. In this blog post I’ll go through the different types of quantization schemes presented in LLM.int8() along with simple numpy implementations and at the end I also show how to use int8 and 4-bit quantization on the Segment-Anything-Model (SAM) backbone to improve the memory consumption. If you haven’t already I also highly recommend checking out Tim Dettmer’s blog on 8-bit matrix multiplication and 4-bit quantization which served as the basis for writing this blog.
Quantization
Quantization in machine learning is a way to reduce latency and memory by representing numbers with fewer bits. I’m going to assume you are familiar with how floating point numbers are represented and get right into int8 quantization techniques. The main objective is to figure out how best to quantize numbers such that you loose the least amount of information. We’ll cover three ways to do this, the first being absmax quantization.
Absmax
Absmax quantization is fairly simple, you just scale the input into the range [-127, 127] by dividing by the maximum value and then multiplying by 127. Here’s a small python script to compute it:
Here s_x si the scaling factor, so the 127 / max(|X|), which we multiply by X to move it to the range of -127 to 127. For example:
You can see above the largest number, 4, went to 127 and the scaling factor used was 31.75 where 4 * 31.75 = 127. You can dequantize this with the following python code:
Dequantizing the above array we get:
But what happens if we only have positive values? We end up only using half of the quantization range:
This can lead to quantization errors. This leads us to zeropoint quantization.
Zeropoint Quantization
Zeropoint quantization solves this issue by scaling then shifting the numbers. First we scale the input by the normalized dynamic range nd:
Then we shift by the zeropoint zp
So we are rescaling to a new range, the size of which is 255:
And then moving the minimum value into this new range, offsetting it by 128, which is half the size of our new range, and using that to shift the range over 0:
Putting it all together in python with a few other checks we have:
And to dequantize we simply subtract the zero point and divide by the scale:
To get even better quantization results, you can also apply either absmax or zeropoint quantization per row or column of a matrix. This helps deal with more variability in the input. You can find a good overview of zeropoint quantization (also called affine quantization) here in secction 3.1.1. It turns out this still isn’t enough to get quantization to work well for larger models as important outlier features can lead to quantization errors.
LLM.int8() Quantization
Tim Dettmers was able to solve this issue in his LLM.int8() paper by introducing a hybrid approach. In the paper he notes that outlier features with large magnitudes start to emerge with larger transformers and have a strong affect on attention and prediction performance. To preserve these features we simply extract outlier features from the input X and multiply those in float16 while we quantize the rest to int8. Here we assume outlier features have magnitude 2 or more:
Once you have to two separate sets of matrices, you can use whatever int8 implementation you want, absmax or zeropoint, the outlier matrices will be multiplied in fp16 and both results added together:
Where row_wise and col_wise quantize functions can be either absmax or zeropoint but applied per row or per column as described above. I also recommend checking out Tim Dettmers’s blog which has a great animation of the above computation. You can find all the code for the above implementations here
Using INT8/NF4 For Vision Models
While the blog shows how easy it is to apply int8 quantization to language models from huggingface, I found it difficult to apply it to other models such as the SAM backbone. To run the following code you’ll first want to install bitsandbytes following the instructions from their github page. Then make sure you install the latest versions of these libraries from github
pip install --upgrade git+https://github.com/huggingface/transformer
pip install --upgrade git+https://github.com/huggingface/accelerate.
You’ll need to build the model in fp16 as the int8 module only works with fp16 input for now. Then call replace_with_bnb_layer
which will replace all linear layers with the 8 bit linear layer (or 4 bit if you choose to use that layer). You can see before calling, we have typical Linear
layers:
and after calling they turn into Linear8BitLt
layers:
If we print out one of the layers model.blacks[0].attn.qkv it prints
Parameter (Int8Params(..., device='meta', size=(3840,1280)))
which is the int8 parameters. But it’s still emtpy and set to the meta device. To fill in the weights we need to call set_module_quantized_tensor_to_device
, which now allows us to see the quantized weights by printing the layer again:
Finally we can call the model by passing it a half precision input, all the steps together look like this:
To get this working with the SAM model you must insure the Linear
layers it replaces are doing matrix multiplication on 2d matrices. I’ve done this in the repository here as well as added all the quantization functions above so you can play around with it. Below are some latency and memory numbers on an RTX A5000 (you can find the 4bit conversion in the repository code):
You can see the max allocated memory reduces by about ~1GB from 32bit to 8bit and you get a little half the latency. The latency times are about ~1.5x slower on 8bit than 16bit but the authors are working to decrease that time. The relative decrease in max allocated memory is not that much compared to 32 or 16 bit but if we calculate the actual model size we get a more clear picture:
This shows about a 75% reduction from 32bit to 8bit and then halving it again when we go to 4 bit!
Conclusion
To recap we’ve covered two basic types of quantization, absmax and zeropoint quantization. We’ve also shown how to quantize a vision transformer model, SAM, using the bitsandbytes library giving us up to an 86% reduction in model size! All the quantization example code as well as code for quantizing and running the SAM backbone are available here