vllm.config.attention ¶
AttentionConfig ¶
Configuration for attention mechanisms in vLLM.
Source code in vllm/config/attention.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | |
backend class-attribute instance-attribute ¶
backend: AttentionBackendEnum | None = None
Attention backend to use. Use "auto" or None for automatic selection.
disable_flashinfer_q_quantization class-attribute instance-attribute ¶
disable_flashinfer_q_quantization: bool = False
If set, when using fp8 kv, do not quantize Q to fp8.
flash_attn_max_num_splits_for_cuda_graph class-attribute instance-attribute ¶
flash_attn_max_num_splits_for_cuda_graph: int = 32
Flash Attention max number splits for cuda graph decode.
flash_attn_version class-attribute instance-attribute ¶
flash_attn_version: Literal[2, 3, 4] | None = None
Force vllm to use a specific flash-attention version (2, 3, or 4). Only valid when using the flash-attention backend.
flex_attn_block_m class-attribute instance-attribute ¶
flex_attn_block_m: int | None = None
Triton kernel BLOCK_M tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_block_n class-attribute instance-attribute ¶
flex_attn_block_n: int | None = None
Triton kernel BLOCK_N tile size for flex attention. Must be a power of 2 >= 16. If None and VLLM_BATCH_INVARIANT=1, defaults to 16.
flex_attn_kv_block_size class-attribute instance-attribute ¶
flex_attn_kv_block_size: int | None = None
Logical KV block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_n. If None, uses the default (kv_cache_block_size on PyTorch >= 2.9, 128 otherwise).
flex_attn_q_block_size class-attribute instance-attribute ¶
flex_attn_q_block_size: int | None = None
Logical Q block size for the flex attention block mask. Must be a power of 2 and divisible by flex_attn_block_m. If None, uses the default (16 on PyTorch >= 2.9, 128 otherwise).
mla_prefill_backend class-attribute instance-attribute ¶
mla_prefill_backend: MLAPrefillBackendEnum | None = None
MLA prefill backend to use. If None, will be selected automatically. Valid options: FLASH_ATTN (FA3/FA4), FLASHINFER, TRTLLM_RAGGED.
tq_max_kv_splits_for_cuda_graph class-attribute instance-attribute ¶
tq_max_kv_splits_for_cuda_graph: int = 32
TurboQuant max NUM_KV_SPLITS for cuda graph decode. Fixes the split count so grid dimensions are constant across captures, and buffers can be pre-allocated to avoid inflating the memory estimate.
use_fp4_indexer_cache class-attribute instance-attribute ¶
use_fp4_indexer_cache: bool = False
If set, use fp4 indexer cache for dsv32 family model (not support yet)
use_non_causal class-attribute instance-attribute ¶
use_non_causal: bool = False
Whether to use non-causal (bidirectional) attention.
use_prefill_decode_attention class-attribute instance-attribute ¶
use_prefill_decode_attention: bool = False
Use separate prefill and decode kernels for attention instead of the unified triton kernel.
use_prefill_query_quantization class-attribute instance-attribute ¶
use_prefill_query_quantization: bool = False
If set, quantize query for attention in prefill.
use_trtllm_attention class-attribute instance-attribute ¶
use_trtllm_attention: bool | None = None
If set to True/False, use or don't use the TRTLLM attention backend in flashinfer. If None, auto-detect the attention backend in flashinfer.
compute_hash ¶
compute_hash() -> str
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/attention.py
validate_backend_before classmethod ¶
Enable parsing of the backend enum type from string.
The special value "auto" is treated as None, which triggers automatic backend selection.
Source code in vllm/config/attention.py
validate_mla_prefill_backend_before classmethod ¶
Enable parsing of the mla_prefill_backend enum type from string.