MooncakeStoreConnector Usage Guide¶
MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:
- CPU/disk offloading: Extend effective KV cache capacity by offloading to CPU memory or disk via Mooncake's transfer engine.
- Prefix caching across instances: Hash-based deduplication allows multiple vLLM instances to share cached KV blocks through the store.
- Single-node and multi-node deployment: Works both as a standalone KV cache extension and in disaggregated prefill-decode setups.
Prerequisites¶
Install Mooncake¶
Install mooncake through pip:
Refer to the Mooncake official repository for more installation instructions and building from source.
Start the Mooncake Master Server¶
The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:
Default ports:
- RPC: 50051
Multiple vLLM instances can share the same master server.
Configure Mooncake¶
Create a JSON configuration file (e.g., mooncake_config.json):
{
"mode": "embedded",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": "80GB",
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": "",
"enable_offload": false
}
mode: Topology selection."embedded"(default, PR-40900 baseline) has each vLLM rank contributeglobal_segment_sizeto the pool in-process."standalone-store"makes ranks pure requesters — an externalmooncake_clientprocess owns the CPU pool and (optionally) the SSD tier.protocol: Use"rdma"for best performance."tcp"works as a fallback.global_segment_size: CPU memory contributed to the distributed pool (per GPU). Must be> 0inembeddedmode and0instandalone-storemode.local_buffer_size: Private buffer for this node's own operations (per GPU).enable_offload: Whentrue, vLLM allocates a DirectIO staging buffer so large prefills do not exceed the owner's SSD-write budget. Set this together with the matching--enable_offload=trueflag onmooncake_masterand on the externalmooncake_client(if any).
Set the config path via environment variable:
Usage¶
Single-Node KV Cache Offloading¶
Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
Disaggregated Prefill-Decode (XpYd)¶
In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store. Prefiller Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8100 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_producer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_producer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_both"
}
]
}
}'
Decoder Node:
MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8200 \
--kv-transfer-config '{
"kv_connector": "MultiConnector",
"kv_role": "kv_consumer",
"kv_connector_extra_config": {
"connectors": [
{
"kv_connector": "MooncakeConnector",
"kv_role": "kv_consumer"
},
{
"kv_connector": "MooncakeStoreConnector",
"kv_role": "kv_consumer"
}
]
}
}'
Proxy:
A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.
Disk Offloading¶
Disk offloading is most commonly run in standalone-store mode: an external mooncake_client process owns the CPU pool and the SSD tier, and each vLLM rank is a pure requester. This avoids per-rank duplication of the SSD pool and keeps DirectIO budget tracking on a single process.
Three things need to be aligned for end-to-end disk offloading:
mooncake_masteris started with--enable_offload=true.mooncake_client(the owner) is started with--enable_offload=trueplus an SSD path viaMOONCAKE_OFFLOAD_FILE_STORAGE_PATH.- vLLM-side sets
"enable_offload": truein the JSON config file (this is read by the connector and is not an environment variable).
Example mooncake_config.json for the vLLM side:
{
"mode": "standalone-store",
"metadata_server": "P2PHANDSHAKE",
"master_server_address": "127.0.0.1:50051",
"global_segment_size": 0,
"local_buffer_size": "4GB",
"protocol": "rdma",
"device_name": "mlx5_0",
"enable_offload": true
}
Steer this rank to the local owner segment with:
The owner's SSD directory, on-disk eviction policy, and the DirectIO staging buffer size are controlled on the mooncake_client side via the standard Mooncake environment variables (MOONCAKE_OFFLOAD_FILE_STORAGE_PATH, MOONCAKE_BUCKET_EVICTION_POLICY, MOONCAKE_USE_URING, MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES, MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES, etc.). Those are independent of the vLLM JSON config.
Environment Variables¶
| Variable | Description | Default |
|---|---|---|
MOONCAKE_CONFIG_PATH | Path to Mooncake JSON config file | (required) |
VLLM_MOONCAKE_BOOTSTRAP_PORT | Bootstrap port for MooncakeConnector P2P transfer (disagg mode only) | 8998 |
MOONCAKE_PREFERRED_SEGMENT | Pin this rank's replicas to a specific owner segment (host:port); used in standalone-store mode | — |
MOONCAKE_REQUESTER_LOCAL_HOSTNAME | Override the hostname the vLLM rank registers with Mooncake as a requester. Defaults to the rank's resolved IP. | — |
VLLM_MOONCAKE_STORE_TIER_LOG | When 1, logs a per-batch tier summary (memory vs disk hits) for observability | disabled |
VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO | Fraction of the owner's DirectIO staging buffer that the requester will fill in a single batch_get_into_multi_buffers call. Lower → more conservative pre-split, more round trips. | 0.9 |
KV Transfer Config¶
KV Role Options¶
- kv_producer: For instances that store KV caches to the pool.
- kv_consumer: For instances that load KV caches from the pool.
- kv_both: The instance both stores and loads KV caches. Use this for single-node CPU offloading or prefiller instances.
kv_connector_extra_config¶
load_async(bool): Enable asynchronous loading for better compute-I/O overlap. Default:true.enable_cross_layers_blocks(bool): Enable cross-layer block packing for reduced store operations. Default:false.discard_partial_chunks(bool): Discard partial block chunks during store. Default:true.lookup_rpc_port(int): Custom port for the ZMQ lookup RPC socket. Default:0.
Notes¶
Reproducible Block Hashes Across Processes¶
The MooncakeStoreConnector relies on consistent block hashes across all vLLM processes sharing the distributed store. Because Python randomizes its hash seed per process by default, identical prompts can produce different block hashes on different processes — preventing cross-process prefix cache hits.
Set a fixed PYTHONHASHSEED on every instance that shares the store (DP ranks, separate prefiller/decoder nodes, and any other vLLM process pointed at the same Mooncake store):