vllm.config.kernel ¶
IrOpPriorityConfig ¶
Configuration for vLLM IR op priority for dispatching/lowering during the forward pass. Each member is a list of strings, which will be installed in worker init via vllm.ir.ops.
If specified manually, platform defaults will be appended to the lists. See KernelConfig.set_platform_defaults().
Source code in vllm/config/kernel.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | |
fused_add_rms_norm class-attribute instance-attribute ¶
Priority list for vllm.ir.ops.fused_add_rms_norm
rms_norm class-attribute instance-attribute ¶
Priority list for vllm.ir.ops.rms_norm
_iter_op_priorities ¶
Yield (IrOp, priority_list) for each field, after importing platform kernels and validating each entry.
Source code in vllm/config/kernel.py
compute_hash ¶
compute_hash() -> str
Produces a hash unique to the pass configuration. Any new fields that affect compilation should be added to the hash. Any future fields that don't affect compilation should be excluded.
Also, manually add IR op impl UUIDs to make sure they affect the compile cache.
Source code in vllm/config/kernel.py
set_default ¶
set_priority ¶
Context manager to set the IR op priority for all op members. It also imports IR kernel implementations for the current platform to ensure all implementations are made available.
Source code in vllm/config/kernel.py
with_default classmethod ¶
with_default(
default: list[str], /, **kwargs: list[str]
) -> IrOpPriorityConfig
A helper to create an IrOpPriorityConfig where fields not specified in kwargs use the given default list.
Source code in vllm/config/kernel.py
KernelConfig ¶
Configuration for kernel selection and warmup behavior.
Source code in vllm/config/kernel.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | |
enable_flashinfer_autotune class-attribute instance-attribute ¶
enable_flashinfer_autotune: bool = None
If True, run FlashInfer autotuning during kernel warmup.
ir_op_priority class-attribute instance-attribute ¶
ir_op_priority: IrOpPriorityConfig = Field(
default_factory=IrOpPriorityConfig
)
vLLM IR op priority for dispatching/lowering during the forward pass. Platform defaults appended automatically during VllmConfig.post_init.
linear_backend class-attribute instance-attribute ¶
Backend for quantized linear layer GEMM kernels. Available options:
- "auto": Automatically select the best backend based on model and hardware
- "cutlass": Use CUTLASS-based kernels
- "flashinfer_cutlass": Use FlashInfer with CUTLASS kernels
- "flashinfer_trtllm": Use FlashInfer with TensorRT-LLM kernels
- "flashinfer_cudnn": Use FlashInfer with cuDNN kernels
- "marlin": Use Marlin kernels
- "triton": Use Triton-based kernels
- "deep_gemm": Use DeepGEMM kernels
- "torch": Use PyTorch native scaled_mm kernels
- "aiter": Use AMD AITer kernels (ROCm only)
- "machete": Use Machete kernels (mixed-precision)
- "fbgemm": Use FBGEMM kernels
- "conch": Use Conch mixed-precision kernels
- "exllama": Use Exllama mixed-precision kernels
- "emulation": Use slow dequant-to-BF16 emulation (for testing only)
moe_backend class-attribute instance-attribute ¶
Backend for MoE expert computation kernels. Available options:
- "auto": Automatically select the best backend based on model and hardware
- "triton": Use Triton-based fused MoE kernels
- "deep_gemm": Use DeepGEMM kernels (FP8 block-quantized only)
- "deep_gemm_mega_moe": Use DeepGEMM mega MoE kernels
- "cutlass": Use vLLM CUTLASS kernels
- "flashinfer_trtllm": Use FlashInfer with TRTLLM-GEN kernels
- "flashinfer_cutlass": Use FlashInfer with CUTLASS kernels
- "flashinfer_cutedsl": Use FlashInfer with CuteDSL kernels (FP4 only)
- "marlin": Use Marlin kernels (weight-only quantization)
- "humming": Use Humming Mixed Precision kernels
- "triton_unfused": Use Triton unfused MoE kernels
- "aiter": Use AMD AITer kernels (ROCm only)
- "emulation": use BF16/FP16 GEMM, dequantizing weights and running QDQ on activations.
_skip_none_validation classmethod ¶
Skip validation if the value is None when initialization is delayed.
Source code in vllm/config/kernel.py
compute_hash ¶
compute_hash() -> str
Produces a hash unique to the pass configuration. Any new fields that affect compilation should be added to the hash. Any future fields that don't affect compilation should be excluded.
Source code in vllm/config/kernel.py
set_platform_defaults ¶
set_platform_defaults(vllm_config: VllmConfig) -> None
Set platform-specific defaults for the kernel config.