Skip to content

MooncakeStoreConnector Usage Guide

MooncakeStoreConnector is a KV cache connector that uses MooncakeDistributedStore as a shared KV cache pool. Unlike MooncakeConnector which does direct point-to-point KV transfer between prefiller and decoder, MooncakeStoreConnector enables KV cache offloading to an external distributed store, supporting:

  • CPU/disk offloading: Extend effective KV cache capacity by offloading to CPU memory or disk via Mooncake's transfer engine.
  • Prefix caching across instances: Hash-based deduplication allows multiple vLLM instances to share cached KV blocks through the store.
  • Single-node and multi-node deployment: Works both as a standalone KV cache extension and in disaggregated prefill-decode setups.

Prerequisites

Install Mooncake

Install mooncake through pip:

uv pip install mooncake-transfer-engine

Refer to the Mooncake official repository for more installation instructions and building from source.

Start the Mooncake Master Server

The Mooncake master manages metadata and coordinates the distributed store. Start it before launching vLLM:

mooncake_master --port 50051

Default ports:

  • RPC: 50051

Multiple vLLM instances can share the same master server.

Configure Mooncake

Create a JSON configuration file (e.g., mooncake_config.json):

{
  "mode": "embedded",
  "metadata_server": "P2PHANDSHAKE",
  "master_server_address": "127.0.0.1:50051",
  "global_segment_size": "80GB",
  "local_buffer_size": "4GB",
  "protocol": "rdma",
  "device_name": "",
  "enable_offload": false
}
  • mode: Topology selection. "embedded" (default, PR-40900 baseline) has each vLLM rank contribute global_segment_size to the pool in-process. "standalone-store" makes ranks pure requesters — an external mooncake_client process owns the CPU pool and (optionally) the SSD tier.
  • protocol: Use "rdma" for best performance. "tcp" works as a fallback.
  • global_segment_size: CPU memory contributed to the distributed pool (per GPU). Must be > 0 in embedded mode and 0 in standalone-store mode.
  • local_buffer_size: Private buffer for this node's own operations (per GPU).
  • enable_offload: When true, vLLM allocates a DirectIO staging buffer so large prefills do not exceed the owner's SSD-write budget. Set this together with the matching --enable_offload=true flag on mooncake_master and on the external mooncake_client (if any).

Set the config path via environment variable:

export MOONCAKE_CONFIG_PATH=/path/to/mooncake_config.json

Usage

Single-Node KV Cache Offloading

Use MooncakeStoreConnector to offload KV cache to CPU memory, extending the effective cache size:

MOONCAKE_CONFIG_PATH=mooncake_config.json \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'

Disaggregated Prefill-Decode (XpYd)

In disaggregated prefill-decode mode, use MultiConnector to combine MooncakeConnector (point-to-point KV transfer) with MooncakeStoreConnector (shared KV cache pool). This enables both direct P2P transfer between prefiller and decoder, and cross-instance prefix cache sharing via the distributed store. Prefiller Node:

MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50052 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8100 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_producer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_producer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_both"
                }
            ]
        }
    }'

Decoder Node:

MOONCAKE_CONFIG_PATH=mooncake_config.json \
VLLM_MOONCAKE_BOOTSTRAP_PORT=50053 \
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --port 8200 \
    --kv-transfer-config '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_consumer",
        "kv_connector_extra_config": {
            "connectors": [
                {
                    "kv_connector": "MooncakeConnector",
                    "kv_role": "kv_consumer"
                },
                {
                    "kv_connector": "MooncakeStoreConnector",
                    "kv_role": "kv_consumer"
                }
            ]
        }
    }'

Proxy:

A disaggregation proxy is required to route requests between prefiller and decoder nodes. The proxy assigns do_remote_prefill=True / do_remote_decode=True to coordinate P2P transfer via MooncakeConnector. Refer to the MooncakeConnector usage guide for proxy setup details.

Disk Offloading

Disk offloading is most commonly run in standalone-store mode: an external mooncake_client process owns the CPU pool and the SSD tier, and each vLLM rank is a pure requester. This avoids per-rank duplication of the SSD pool and keeps DirectIO budget tracking on a single process.

Three things need to be aligned for end-to-end disk offloading:

  1. mooncake_master is started with --enable_offload=true.
  2. mooncake_client (the owner) is started with --enable_offload=true plus an SSD path via MOONCAKE_OFFLOAD_FILE_STORAGE_PATH.
  3. vLLM-side sets "enable_offload": true in the JSON config file (this is read by the connector and is not an environment variable).

Example mooncake_config.json for the vLLM side:

{
  "mode": "standalone-store",
  "metadata_server": "P2PHANDSHAKE",
  "master_server_address": "127.0.0.1:50051",
  "global_segment_size": 0,
  "local_buffer_size": "4GB",
  "protocol": "rdma",
  "device_name": "mlx5_0",
  "enable_offload": true
}

Steer this rank to the local owner segment with:

export MOONCAKE_PREFERRED_SEGMENT=127.0.0.1:50053

The owner's SSD directory, on-disk eviction policy, and the DirectIO staging buffer size are controlled on the mooncake_client side via the standard Mooncake environment variables (MOONCAKE_OFFLOAD_FILE_STORAGE_PATH, MOONCAKE_BUCKET_EVICTION_POLICY, MOONCAKE_USE_URING, MOONCAKE_OFFLOAD_LOCAL_BUFFER_SIZE_BYTES, MOONCAKE_OFFLOAD_TOTAL_SIZE_LIMIT_BYTES, etc.). Those are independent of the vLLM JSON config.

Environment Variables

Variable Description Default
MOONCAKE_CONFIG_PATH Path to Mooncake JSON config file (required)
VLLM_MOONCAKE_BOOTSTRAP_PORT Bootstrap port for MooncakeConnector P2P transfer (disagg mode only) 8998
MOONCAKE_PREFERRED_SEGMENT Pin this rank's replicas to a specific owner segment (host:port); used in standalone-store mode
MOONCAKE_REQUESTER_LOCAL_HOSTNAME Override the hostname the vLLM rank registers with Mooncake as a requester. Defaults to the rank's resolved IP.
VLLM_MOONCAKE_STORE_TIER_LOG When 1, logs a per-batch tier summary (memory vs disk hits) for observability disabled
VLLM_MOONCAKE_DISK_STAGING_USABLE_RATIO Fraction of the owner's DirectIO staging buffer that the requester will fill in a single batch_get_into_multi_buffers call. Lower → more conservative pre-split, more round trips. 0.9

KV Transfer Config

KV Role Options

  • kv_producer: For instances that store KV caches to the pool.
  • kv_consumer: For instances that load KV caches from the pool.
  • kv_both: The instance both stores and loads KV caches. Use this for single-node CPU offloading or prefiller instances.

kv_connector_extra_config

  • load_async (bool): Enable asynchronous loading for better compute-I/O overlap. Default: true.
  • enable_cross_layers_blocks (bool): Enable cross-layer block packing for reduced store operations. Default: false.
  • discard_partial_chunks (bool): Discard partial block chunks during store. Default: true.
  • lookup_rpc_port (int): Custom port for the ZMQ lookup RPC socket. Default: 0.

Notes

Reproducible Block Hashes Across Processes

The MooncakeStoreConnector relies on consistent block hashes across all vLLM processes sharing the distributed store. Because Python randomizes its hash seed per process by default, identical prompts can produce different block hashes on different processes — preventing cross-process prefix cache hits.

Set a fixed PYTHONHASHSEED on every instance that shares the store (DP ranks, separate prefiller/decoder nodes, and any other vLLM process pointed at the same Mooncake store):

PYTHONHASHSEED=0 vllm serve ...