Online Quantization¶
Online quantization lets you take a BF16/FP16 model and quantize its Linear and MoE weights to lower precision (such as FP8) at load time, without needing a pre-quantized checkpoint or calibration data. Weights are converted during model loading and activations are dynamically scaled during each forward pass.
Quick Start¶
Pass a scheme name to the quantization parameter:
from vllm import LLM
# Per-tensor FP8 quantization (one scale per weight tensor)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_tensor")
# Per-block FP8 quantization (128x128 block scaling for weights and 1x128 block scaling for activations)
llm = LLM("meta-llama/Llama-3.1-8B", quantization="fp8_per_block")
# MXFP8 quantization for weights and activations
llm = LLM("meta-llama/Llama-3.1-8B", quantization="mxfp8")
Or with the CLI:
vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_tensor
vllm serve meta-llama/Llama-3.1-8B --quantization fp8_per_block
vllm serve meta-llama/Llama-3.1-8B --quantization mxfp8
Supported Schemes¶
| Scheme | Weight recipe | Activation recipe | Notes |
|---|---|---|---|
fp8_per_tensor | fp8_e4m3 data, fp32 per-tensor scale | fp8_e4m3 data, fp32 per-tensor scale | On some GPUs (Ada, Hopper) linear activations use per-token scaling for better performance |
fp8_per_block | fp8_e4m3 data, fp32 per-128x128-block scale | fp8_e4m3 data, fp32 per-1x128-block scale | |
mxfp8 | fp8_e4m3 data, e8m0 per-1x32-block scale | fp8_e4m3 data, e8m0 per-1x32-block scale | Requires SM 100+ (Blackwell or newer) for w8a8, other GPUs use a w8a16 fallback |
Advanced Configuration¶
For fine-grained control, use a quantization_config dictionary.
Schema¶
quantization_config:
linear:
weight: <name> # see QUANT_KEY_NAMES in vllm/config/quantization.py
activation: <name>
moe:
weight: <name>
activation: <name>
ignore: [<layer-name-or-regex>, ...]
linear and moe accept a full {weight, activation} dict, or a bare string. A string resolves first against the --quantization shorthands (taking the matching layer-kind slot), then against QUANT_KEY_NAMES as a weight name. Unset fields fall back to the --quantization shorthand's defaults, or for already-quantized checkpoints to whatever the checkpoint declares.
The CLI accepts the same shape as JSON or as dotted keys:
vllm serve <model> --quantization-config '{"moe":{"activation":"mxfp8"}}'
vllm serve <model> --quantization-config.moe.activation mxfp8
Activation overrides on already-quantized checkpoints¶
For checkpoint-quantized models, quantization_config lets you pick an activation format independently of the baked-in weights. The supported overrides are checkpoint-specific; today this is wired up for MXFP4 MoE checkpoints (gpt-oss) where you can opt into FP8 activations:
Combine with --moe-backend to pin a specific kernel family.
Separate Schemes for Dense and MoE Layers¶
You can apply different quantization schemes to dense linear layers and MoE expert layers via the linear and moe fields. Each accepts either a full spec dict, or a bare string naming an online shorthand (e.g. "fp8_per_block") or weight format (e.g. "fp8_per_block_static"); fields not set fall back to the shorthand defaults.
from vllm import LLM
# Linear: per-block FP8; MoE: per-tensor FP8 (inherited from the shorthand)
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"linear": "fp8_per_block",
},
)
Or,
from vllm import LLM
# Linear: per-tensor FP8 (inherited); MoE: per-block FP8
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"moe": "fp8_per_block",
},
)
Excluding Layers from Quantization¶
Use the ignore parameter to skip specific layers. It accepts exact layer names and regex patterns (prefixed with re:):
from vllm import LLM
llm = LLM(
"ibm-granite/granite-3.0-1b-a400m-base",
quantization="fp8_per_tensor",
quantization_config={
"ignore": [
# exact layer name
"model.layers.1.self_attn.o_proj",
# regex: skip all QKV projections
"re:.*[qkv]_proj",
],
},
)
Note
For fused layers (e.g., qkv_proj which fuses q_proj, k_proj, v_proj), the ignore pattern must match the unfused shard names (q_proj, k_proj, v_proj), not the fused name.