Quantization
llms September 1st, 2023: running in Colab runtime T4 GPU
Overview¶
The goal of this notebook is to take a deeper look at quantization. In particular, I import the Opt-350m model and under different quantization schemes, inspect 1) the size of the model 2) the weight data types and values.
Summary¶
The code shows the following results:
- Full model: 1,325 MB of memory and the weights are stored as fp32.
- 8-bit: 359 MB, which is 3-4x smaller and the weights are stored as int8, with range -128 to 127.
- 4-bit: 208 MB, which is ~6x smaller and the weights are stored as uint8 wrapped in a Params4bit class.
Open Questions¶
- Why does the shape of the weights in the 4-bit Linear4bit layer change?
- While the parameter weights are stored like this, compute/inference will occur as specified with the bnb_4bit_compute_dtypeparameter. How does this impact speed up?
- Why don't CPU devices support 8-bit cores?
Notes¶
- 1 byte: 8 bits
- FP32: 32 bits, 4 bytes, full-precision
- Float16: 16 bits, 2 bytes, half-precision
- bfloat16: 16 bits, 2 bytes, 'brain float', half-precision
- int8: 8 bits, 1 byte, 256 values (range -128 to 127)
- uint8: 8 bits, 1 byte, 256 values (range 0 to 255)
Code¶
Imports¶
In [ ]:
!pip install -q -U transformers
!pip install -q -U bitsandbytes
!pip install -q -U accelerate
In [ ]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
Function to import model, measure memory size and return sample weights
In [ ]:
def get_model_info(bnb_config):
  model_id = "facebook/opt-350m"
  model = AutoModelForCausalLM.from_pretrained(model_id, device_map='auto', quantization_config = bnb_config)
  #print(model)
  print(f"Model memory: {model.get_memory_footprint() / 1e6:,.0f} MB")
  layer = model.get_parameter('model.decoder.project_in.weight')
  print(f'Layer: {layer}')
  print(f"Model weight shape: {layer.shape}")
  print(f"Model weight dtype: {layer.dtype}")
  w = layer.detach().cpu().numpy()
  return w
Import Full Model¶
In [ ]:
wf = get_model_info(bnb_config=None)
wf[0][:5]
Model memory: 1,325 MB
Layer: Parameter containing:
tensor([[ 0.1122, -0.0844, -0.0203,  ...,  0.0970,  0.0074,  0.0431],
        [-0.0696, -0.0037, -0.0627,  ...,  0.0359, -0.0157,  0.0105],
        [-0.0268,  0.0077,  0.0630,  ..., -0.0326,  0.0033, -0.0459],
        ...,
        [ 0.0150, -0.0346, -0.0784,  ...,  0.0442,  0.0326,  0.0418],
        [ 0.0468, -0.0705,  0.0620,  ...,  0.0169,  0.0159,  0.0397],
        [-0.0149,  0.0487,  0.0774,  ...,  0.0274, -0.0091, -0.0626]],
       device='cuda:0', requires_grad=True)
Model weight shape: torch.Size([1024, 512])
Model weight dtype: torch.float32
Out[ ]:
array([ 0.11218262, -0.08435059, -0.02027893,  0.02336121,  0.01412964],
      dtype=float32)
Import 8-bit¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
w8b = get_model_info(bnb_config=bnb_config)
w8b[0][:5]
Model memory: 359 MB
Layer: Parameter containing:
Parameter(Int8Params([[114, -86, -21,  ...,  99,   8,  44],
            [-71,  -4, -64,  ...,  37, -16,  11],
            [-27,   8,  64,  ..., -33,   3, -47],
            ...,
            [ 15, -35, -80,  ...,  45,  33,  42],
            [ 47, -71,  63,  ...,  17,  16,  40],
            [-15,  50,  79,  ...,  28,  -9, -64]], device='cuda:0',
           dtype=torch.int8))
Model weight shape: torch.Size([1024, 512])
Model weight dtype: torch.int8
Out[ ]:
array([114, -86, -21, 24, 14], dtype=int8)
Import 4-bit & FP4¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="fp4")
w4b = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB
Layer: Parameter containing:
Parameter(Params4bit([[ 58],
            [230],
            [100],
            ...,
            [220],
            [ 39],
            [154]], device='cuda:0', dtype=torch.uint8))
Model weight shape: torch.Size([262144, 1])
Model weight dtype: torch.uint8
In [ ]:
w4b[:5]
Out[ ]:
array([[ 58],
       [230],
       [100],
       [236],
       [ 91]], dtype=uint8)
4-bit & NF4¶
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=False, bnb_4bit_quant_type="nf4")
w4bnf4 = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB
Layer: Parameter containing:
Parameter(Params4bit([[241],
            [ 90],
            [155],
            ...,
            [ 36],
            [234],
            [ 98]], device='cuda:0', dtype=torch.uint8))
Model weight shape: torch.Size([262144, 1])
Model weight dtype: torch.uint8
4-bit double quant¶
For enabling nested quantization, you can use the bnb_4bit_use_double_quant argument in BitsAndBytesConfig. This will enable a second quantization after the first one to save an additional 0.4 bits per parameter
In [ ]:
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_use_double_quant=True)
w4b_dq = get_model_info(bnb_config=bnb_config)
Model memory: 208 MB
Layer: Parameter containing:
Parameter(Params4bit([[ 58],
            [230],
            [100],
            ...,
            [220],
            [ 39],
            [154]], device='cuda:0', dtype=torch.uint8))
Model weight shape: torch.Size([262144, 1])
Model weight dtype: torch.uint8