vllm.v1.kv_offload.tiering.spec ¶
TieringOffloadingSpec: Spec for multi-tier KV cache offloading.
This spec creates a TieringOffloadingManager with a CPU primary tier and configurable secondary tiers (e.g., Storage, Network).
Configuration via kv_connector_extra_config
- cpu_bytes_to_use: (required) Bytes to allocate for CPU primary tier
- block_size: (optional) Block size for offloaded blocks (default: GPU block size)
- eviction_policy: (optional) Primary tier eviction policy: "lru" or "arc" (default: "lru")
- secondary_tiers: (optional) List of secondary tier configurations Each secondary tier config is a dict with:
- type: (required) Type of secondary tier (e.g., "example", "storage", "network")
- Additional tier-specific parameters are passed directly to the tier constructor. See each tier's documentation for supported parameters.
Example configuration: { "cpu_bytes_to_use": 10737418240, # 10 GB "block_size": 16, "eviction_policy": "lru", "secondary_tiers": [ { "type": "example", "custom_param": 67 } ] }
TieringOffloadingSpec ¶
Bases: CPUOffloadingSpec
Spec for multi-tier KV cache offloading.
Creates a TieringOffloadingManager with: - Primary tier: CPU (LRU or ARC eviction policy) - Secondary tiers: Configurable via extra_config
The CPU primary tier has direct GPU access and serves as the gateway for all GPU↔offload operations. Secondary tiers cannot directly access GPU memory and must transfer data through the primary tier.
Source code in vllm/v1/kv_offload/tiering/spec.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | |
get_manager ¶
get_manager() -> OffloadingManager
Get the TieringOffloadingManager.
Creates a TieringOffloadingManager with: - Primary tier: CPU (LRU or ARC) - Secondary tiers: As configured in extra_config
Returns:
| Type | Description |
|---|---|
OffloadingManager | TieringOffloadingManager instance |
Source code in vllm/v1/kv_offload/tiering/spec.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 | |