Login
Choose your platform
LandingLens

End-to-end Visual AI platform for training and deploying vision models

right arrow
Visual AI Tools & APIs

Agentic Document Extraction, Object Detection and Code Generation. Visual AI tools and APIs for developers.

right arrow
Sign up
Choose your platform
LandingLens

End-to-end Visual AI platform for training and deploying vision models

right arrow
Visual AI Tools & APIs

Agentic Document Extraction, Object Detection and Code Generation. Visual AI tools and APIs for developers.

right arrow
Pricing
Choose the platform
LandingLens
right arrow
Visual AI Tools and APIs
right arrow

How Quantization Works & Quantizing SAM

Dillon Laird

Introduction

Several papers have come out recently showing how to run large language models with much less memory so they can be and infer on smaller devices such asย LLM.int8()ย andย QLoRA. I wanted to better understand how they work and also apply them toย transformer vision models. In this blog post Iโ€™ll go through the different types of quantization schemes presented in LLM.int8()ย along with simple numpy implementations and at the end I also show how to use int8 and 4-bit quantization on theย Segment-Anything-Model (SAM)ย backbone to improve the memory consumption. If you havenโ€™t already I also highly recommend checking out Tim Dettmerโ€™sย blog on 8-bit matrix multiplicationย andย 4-bit quantizationย which served as the basis for writing this blog.

Quantization

Quantization in machine learning is a way to reduce latency and memory by representing numbers with fewer bits. Iโ€™m going to assume you are familiar with how floating point numbers are represented and get right into int8 quantization techniques. The main objective is to figure out how best to quantize numbers such that you loose the least amount of information. Weโ€™ll cover three ways to do this, the first beingย absmax quantization.

Absmax

Absmax quantizationย is fairly simple, you just scale the input into the range [-127, 127] by dividing by the maximum value and then multiplying by 127. Hereโ€™s a small python script to compute it:

 

Here s_x si the scaling factor, so the 127 / max(|X|), which we multiply by X to move it to the range of -127 to 127. For example:

 

You can see above the largest number, 4, went to 127 and the scaling factor used was 31.75 where 4 * 31.75 = 127. You can dequantize this with the following python code:

 

Dequantizing the above array we get:

 

But what happens if we only have positive values? We end up only using half of the quantization range:

 

This can lead to quantization errors. This leads us toย zeropoint quantization.

Zeropoint Quantization

Zeropoint quantizationย solves this issue by scaling then shifting the numbers. First we scale the input by the normalized dynamic range nd:

 

Then we shift by the zeropointย zp

 

So we are rescaling to a new range, the size of which is 255:

 

And then moving the minimum value into this new range, offsetting it by 128, which is half the size of our new range, and using that to shift the range over 0:

 

Putting it all together in python with a few other checks we have:

 

 

And to dequantize we simply subtract the zero point and divide by the scale:

 

 

To get even better quantization results, you can also apply eitherย absmaxย orย zeropointย quantization per row or column of a matrix. This helps deal with more variability in the input. You can find a good overview of zeropoint quantization (also called affine quantization)ย hereย in secction 3.1.1. It turns out this still isnโ€™t enough to get quantization to work well for larger models as important outlier features can lead to quantization errors.

LLM.int8() Quantization

Tim Dettmers was able to solve this issue in hisย LLM.int8() paperย by introducing a hybrid approach. In the paper he notes that outlier features with large magnitudes start to emerge with larger transformers and have a strong affect on attention and prediction performance. To preserve these features we simply extract outlier features from the input X and multiply those in float16 while we quantize the rest to int8. Here we assume outlier features have magnitude 2 or more:

 

 

Once you have to two separate sets of matrices, you can use whatever int8 implementation you want,ย absmaxย orย zeropoint, the outlier matrices will be multiplied in fp16 and both results added together:

 

Where row_wise and col_wise quantize functions can be eitherย absmaxย orย zeropointย but applied per row or per column as described above. I also recommend checking out Tim Dettmersโ€™sย blogย which has a great animation of the above computation. You can find all the code for the above implementationsย here

Using INT8/NF4 For Vision Models

While theย blogย shows how easy it is to apply int8 quantization to language models from huggingface, I found it difficult to apply it to other models such as the SAM backbone. To run the following code youโ€™ll first want to install bitsandbytes following the instructions from theirย github page.ย Then make sure you install the latest versions of these libraries from github


pip install --upgrade git+https://github.com/huggingface/transformer
pip install --upgrade git+https://github.com/huggingface/accelerate.
  

Youโ€™ll need to build the model in fp16 as the int8 module only works with fp16 input for now. Then callย replace_with_bnb_layerย which will replace all linear layers with the 8 bit linear layer (or 4 bit if you choose to use that layer). You can see before calling, we have typicalย Linearย layers:

 

 

and after calling they turn intoย Linear8BitLtย layers:

 

 

If we print out one of the layers model.blacks[0].attn.qkv it prints
Parameter (Int8Params(..., device='meta', size=(3840,1280)))ย which is the int8 parameters. But itโ€™s still emtpy and set to the meta device. To fill in the weights we need to callย set_module_quantized_tensor_to_device, which now allows us to see the quantized weights by printing the layer again:

 

 

Finally we can call the model by passing it a half precision input, all the steps together look like this:

 

 

To get this working with the SAM model you must insure theย Linearย layers it replaces are doing matrix multiplication on 2d matrices. Iโ€™ve done this in the repositoryย hereย as well as added all the quantization functions above so you can play around with it. Below are some latency and memory numbers on an RTX A5000 (you can find the 4bit conversion in the repository code):

 

 

You can see the max allocated memory reduces by about ~1GB from 32bit to 8bit and you get a little half the latency. The latency times are about ~1.5x slower on 8bit than 16bit but the authors are working to decrease that time. The relative decrease in max allocated memory is not that much compared to 32 or 16 bit but if we calculate the actual model size we get a more clear picture:

 

This shows about a 75% reduction from 32bit to 8bit and then halving it again when we go to 4 bit!

Conclusion

To recap weโ€™ve covered two basic types of quantization,ย absmaxย andย zeropointย quantization. Weโ€™ve also shown how to quantize a vision transformer model, SAM, using the bitsandbytes library giving us up to an 86% reduction in model size! All the quantization example code as well as code for quantizing and running the SAM backbone are availableย here

Join our Newsletter

Subscribe to our Newsletter to receive exclusive offers, latest news and updates.

Decorative icon