Quantization
llms
September 1st, 2023: running in Colab runtime T4 GPU
Overview¶
The goal of this notebook is to take a deeper look at quantization. In particular, I import the Opt-350m model and under different quantization schemes, inspect 1) the size of the model 2) the weight data types and values.
Summary¶
The code shows the following results:
- Full model: 1,325 MB of memory and the weights are stored as fp32.
- 8-bit: 359 MB, which is 3-4x smaller and the weights are stored as int8, with range -128 to 127.
- 4-bit: 208 MB, which is ~6x smaller and the weights are stored as uint8 wrapped in a Params4bit class.
Open Questions¶
- Why does the shape of the weights in the 4-bit Linear4bit layer change?
- While the parameter weights are stored like this, compute/inference will occur as specified with the
bnb_4bit_compute_dtype
parameter. How does this impact speed up? - Why don't CPU devices support 8-bit cores?
Notes¶
- 1 byte: 8 bits
- FP32: 32 bits, 4 bytes, full-precision
- Float16: 16 bits, 2 bytes, half-precision
- bfloat16: 16 bits, 2 bytes, 'brain float', half-precision
- int8: 8 bits, 1 byte, 256 values (range -128 to 127)
- uint8: 8 bits, 1 byte, 256 values (range 0 to 255)
Code¶
Imports¶
In [ ]:
!pip install -q -U transformers
!pip install -q -U bitsandbytes
!pip install -q -U accelerate
In [ ]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
Function to import model, measure memory size and return sample weights
In [ ]:
def get_model_info(bnb_config):
model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', quantization_config = bnb_config)
#print(model)
print(f"Model memory: {model.get_memory_footprint() / 1e6:,.0f} MB")
layer = model.get_parameter('model.decoder.project_in.weight')
print(f'Layer: {layer}')
print(f"Model weight shape: {layer.shape}")
print(f"Model weight dtype: {layer.dtype}")
w = layer.detach().cpu().numpy()
return w
Import Full Model¶
In [ ]:
wf = get_model_info(bnb_config=None)
wf[0][:5]
Model memory: 1,325 MB Layer: Parameter containing: tensor([[ 0.1122, -0.0844, -0.0203, ..., 0.0970, 0.0074, 0.0431], [-0.0696, -0.0037, -0.0627, ..., 0.0359, -0.0157, 0.0105], [-0.0268, 0.0077, 0.0630, ..., -0.0326, 0.0033, -0.0459], ..., [ 0.0150, -0.0346, -0.0784, ..., 0.0442, 0.0326, 0.0418], [ 0.0468, -0.0705, 0.0620, ..., 0.0169, 0.0159, 0.0397], [-0.0149, 0.0487, 0.0774, ..., 0.0274, -0.0091, -0.0626]], device='cuda:0', requires_grad=True) Model weight shape: torch.Size([1024, 512]) Model weight dtype: torch.float32
Out[ ]:
array([ 0.11218262, -0.08435059, -0.02027893, 0.02336121, 0.01412964], dtype=float32)
Import 8-bit¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
w8b = get_model_info(bnb_config=bnb_config)
w8b[0][:5]
Model memory: 359 MB Layer: Parameter containing: Parameter(Int8Params([[114, -86, -21, ..., 99, 8, 44], [-71, -4, -64, ..., 37, -16, 11], [-27, 8, 64, ..., -33, 3, -47], ..., [ 15, -35, -80, ..., 45, 33, 42], [ 47, -71, 63, ..., 17, 16, 40], [-15, 50, 79, ..., 28, -9, -64]], device='cuda:0', dtype=torch.int8)) Model weight shape: torch.Size([1024, 512]) Model weight dtype: torch.int8
Out[ ]:
array([114, -86, -21, 24, 14], dtype=int8)
Import 4-bit & FP4¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="fp4")
w4b = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB Layer: Parameter containing: Parameter(Params4bit([[ 58], [230], [100], ..., [220], [ 39], [154]], device='cuda:0', dtype=torch.uint8)) Model weight shape: torch.Size([262144, 1]) Model weight dtype: torch.uint8
In [ ]:
w4b[:5]
Out[ ]:
array([[ 58], [230], [100], [236], [ 91]], dtype=uint8)
4-bit & NF4¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4")
w4bnf4 = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB Layer: Parameter containing: Parameter(Params4bit([[241], [ 90], [155], ..., [ 36], [234], [ 98]], device='cuda:0', dtype=torch.uint8)) Model weight shape: torch.Size([262144, 1]) Model weight dtype: torch.uint8
4-bit double quant¶
For enabling nested quantization, you can use the bnb_4bit_use_double_quant argument in BitsAndBytesConfig. This will enable a second quantization after the first one to save an additional 0.4 bits per parameter
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
w4b_dq = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB Layer: Parameter containing: Parameter(Params4bit([[ 58], [230], [100], ..., [220], [ 39], [154]], device='cuda:0', dtype=torch.uint8)) Model weight shape: torch.Size([262144, 1]) Model weight dtype: torch.uint8