vllm.models.deepseek_v4.common.ops.fused_inv_rope_fp8_quant ¶
Fused inverse RoPE + block-scaled FP8 quantization kernel for DeepseekV4 attention.
Output scale format is pre-transformed (MN-major TMA-aligned; FP32 on SM90, INT32-packed UE8M0 on SM100) so fp8_einsum skips transform_sf_into_required_layout.
fused_inv_rope_fp8_quant ¶
fused_inv_rope_fp8_quant(
o: Tensor,
positions: Tensor,
cos_sin_cache: Tensor,
n_groups: int,
heads_per_group: int,
nope_dim: int = 448,
rope_dim: int = 64,
quant_group_size: int = 128,
tma_aligned_scales: bool = False,
) -> tuple[Tensor, Tensor]
Fused inverse RoPE + block-scaled FP8 quantization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
o | Tensor | Attention output [num_tokens, num_heads, head_dim] bf16. | required |
positions | Tensor | Token positions [num_tokens] int64. | required |
cos_sin_cache | Tensor | Precomputed [max_pos, rope_dim] with cos||sin. | required |
n_groups | int | Number of output groups. | required |
heads_per_group | int | Heads per group. | required |
nope_dim | int | Non-RoPE dimensions per head (default 448). | 448 |
rope_dim | int | RoPE dimensions per head (default 64). | 64 |
quant_group_size | int | FP8 quantization block size (default 128). | 128 |
tma_aligned_scales | bool | Output INT32 packed UE8M0 for SM100 (True) or FP32 for SM90 (False). | False |
Returns:
| Name | Type | Description |
|---|---|---|
o_fp8 | Tensor | [T, G, D] float8_e4m3fn, strides (D, T*D, 1). |
o_scale | Tensor | Pre-transformed scale tensor for fp8_einsum. |