Skip to content

vllm.v1.kv_offload.tiering.base

Abstract interfaces and data types for the secondary tiering layer.

JobMetadata dataclass

Metadata for an in-flight async transfer job.

Source code in vllm/v1/kv_offload/tiering/base.py
@dataclass
class JobMetadata:
    """Metadata for an in-flight async transfer job."""

    job_id: JobId
    keys: Collection[OffloadKey]
    block_ids: np.ndarray
    is_promotion: bool
    req_context: ReqContext

JobResult dataclass

Result of an async transfer job (successful or failed).

Source code in vllm/v1/kv_offload/tiering/base.py
@dataclass
class JobResult:
    """Result of an async transfer job (successful or failed)."""

    job_id: JobId
    success: bool

SecondaryTierManager

Bases: ABC

Abstract interface for managing a single non-primary offloading tier.

Secondary tiers cannot directly access GPU memory. All data transfers must go through the CPU (primary) tier: - Store: GPU → CPU (primary) → secondary (cascade) - Load: secondary → CPU (primary) → GPU (promotion)

IMPORTANT: All methods run in the Scheduler process and must be lightweight and non-blocking. submit_load() and submit_store() submit async jobs; get_finished() polls for completion.

Source code in vllm/v1/kv_offload/tiering/base.py
class SecondaryTierManager(ABC):
    """
    Abstract interface for managing a single non-primary offloading tier.

    Secondary tiers cannot directly access GPU memory. All data transfers
    must go through the CPU (primary) tier:
      - Store: GPU → CPU (primary) → secondary  (cascade)
      - Load:  secondary → CPU (primary) → GPU  (promotion)

    IMPORTANT: All methods run in the Scheduler process and must be
    lightweight and non-blocking. submit_load() and submit_store() submit
    async jobs; get_finished() polls for completion.
    """

    def __init__(
        self,
        offloading_spec: "OffloadingSpec",
        primary_kv_view: memoryview,
        tier_type: str,
    ) -> None:
        """
        Args:
            offloading_spec: Offloading configuration.
            primary_kv_view: Memoryview of the primary tier's CPU KV cache.
            tier_type: Tier type identifier, set by SecondaryTierFactory
                from the registered tier type.
        """
        self._offloading_spec = offloading_spec
        self._primary_kv_view: memoryview = primary_kv_view
        self.tier_type = tier_type

    @abstractmethod
    def lookup(self, key: OffloadKey, req_context: ReqContext) -> bool | None:
        """
        Check whether a block exists in this secondary tier.

        Args:
            key: Offload key to look up.
            req_context: per-request context (e.g. kv_transfer_params).

        Returns:
            True if the block is present and ready,
            False if not found,
            or None if the block is being transferred (retry later).
        """
        pass

    @abstractmethod
    def submit_store(self, job_metadata: JobMetadata) -> None:
        """
        Submit an async job to store blocks from the primary tier to this
        secondary tier.

        This method must be lightweight and non-blocking: allocate metadata
        and submit the transfer, but do NOT perform the data copy on the
        calling thread.

        Preconditions (guaranteed by the framework):
          - ``job_metadata.block_ids`` are valid primary-tier slots, pinned
            (ref-counted) for the duration of the transfer.

        The implementation is responsible for:
          1. Filtering out blocks already present in this tier
          2. Evicting blocks if capacity is needed
          3. Allocating space in this tier
          4. Submitting the async transfer (read from primary via block_ids)

        Report completion via ``get_finished()``.

        Args:
            job_metadata: Job metadata including job_id, keys, and block_ids
                          identifying the primary-tier slots to read from.
        """
        pass

    @abstractmethod
    def submit_load(self, job_metadata: JobMetadata) -> None:
        """
        Submit an async job to load blocks from this secondary tier to the
        primary tier.

        This method must be lightweight and non-blocking: mark blocks as
        in-flight and submit the transfer, but do NOT perform the data copy
        on the calling thread.

        Preconditions (guaranteed by the framework):
          - ``job_metadata.block_ids`` are allocated primary-tier slots
            ready to receive data.

        The implementation must copy data from this tier into the
        primary-tier slots identified by ``block_ids``.

        Report completion via ``get_finished()``.

        Args:
            job_metadata: Job metadata including job_id, keys, and block_ids
                          identifying the primary-tier slots to write into.
        """
        pass

    @abstractmethod
    def get_finished(self) -> Iterable[JobResult]:
        """
        Return all jobs (loads and stores) that completed since the last call.

        The framework uses these results to release resources and finalize
        transfers.

        Returns:
            Iterable of JobResult objects for jobs finished since the
            last call.
        """
        pass

    def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
        """
        Mark blocks as recently used for eviction policy.

        Args:
            keys: Offload keys to mark as recently used.
            req_context: Per-request context.
        """
        return

    def shutdown(self) -> None:
        """Release resources held by this tier (threads, connections, etc.)."""
        return

__init__

__init__(
    offloading_spec: OffloadingSpec,
    primary_kv_view: memoryview,
    tier_type: str,
) -> None

Parameters:

Name Type Description Default
offloading_spec OffloadingSpec

Offloading configuration.

required
primary_kv_view memoryview

Memoryview of the primary tier's CPU KV cache.

required
tier_type str

Tier type identifier, set by SecondaryTierFactory from the registered tier type.

required
Source code in vllm/v1/kv_offload/tiering/base.py
def __init__(
    self,
    offloading_spec: "OffloadingSpec",
    primary_kv_view: memoryview,
    tier_type: str,
) -> None:
    """
    Args:
        offloading_spec: Offloading configuration.
        primary_kv_view: Memoryview of the primary tier's CPU KV cache.
        tier_type: Tier type identifier, set by SecondaryTierFactory
            from the registered tier type.
    """
    self._offloading_spec = offloading_spec
    self._primary_kv_view: memoryview = primary_kv_view
    self.tier_type = tier_type

get_finished abstractmethod

get_finished() -> Iterable[JobResult]

Return all jobs (loads and stores) that completed since the last call.

The framework uses these results to release resources and finalize transfers.

Returns:

Type Description
Iterable[JobResult]

Iterable of JobResult objects for jobs finished since the

Iterable[JobResult]

last call.

Source code in vllm/v1/kv_offload/tiering/base.py
@abstractmethod
def get_finished(self) -> Iterable[JobResult]:
    """
    Return all jobs (loads and stores) that completed since the last call.

    The framework uses these results to release resources and finalize
    transfers.

    Returns:
        Iterable of JobResult objects for jobs finished since the
        last call.
    """
    pass

lookup abstractmethod

lookup(
    key: OffloadKey, req_context: ReqContext
) -> bool | None

Check whether a block exists in this secondary tier.

Parameters:

Name Type Description Default
key OffloadKey

Offload key to look up.

required
req_context ReqContext

per-request context (e.g. kv_transfer_params).

required

Returns:

Type Description
bool | None

True if the block is present and ready,

bool | None

False if not found,

bool | None

or None if the block is being transferred (retry later).

Source code in vllm/v1/kv_offload/tiering/base.py
@abstractmethod
def lookup(self, key: OffloadKey, req_context: ReqContext) -> bool | None:
    """
    Check whether a block exists in this secondary tier.

    Args:
        key: Offload key to look up.
        req_context: per-request context (e.g. kv_transfer_params).

    Returns:
        True if the block is present and ready,
        False if not found,
        or None if the block is being transferred (retry later).
    """
    pass

shutdown

shutdown() -> None

Release resources held by this tier (threads, connections, etc.).

Source code in vllm/v1/kv_offload/tiering/base.py
def shutdown(self) -> None:
    """Release resources held by this tier (threads, connections, etc.)."""
    return

submit_load abstractmethod

submit_load(job_metadata: JobMetadata) -> None

Submit an async job to load blocks from this secondary tier to the primary tier.

This method must be lightweight and non-blocking: mark blocks as in-flight and submit the transfer, but do NOT perform the data copy on the calling thread.

Preconditions (guaranteed by the framework): - job_metadata.block_ids are allocated primary-tier slots ready to receive data.

The implementation must copy data from this tier into the primary-tier slots identified by block_ids.

Report completion via get_finished().

Parameters:

Name Type Description Default
job_metadata JobMetadata

Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to write into.

required
Source code in vllm/v1/kv_offload/tiering/base.py
@abstractmethod
def submit_load(self, job_metadata: JobMetadata) -> None:
    """
    Submit an async job to load blocks from this secondary tier to the
    primary tier.

    This method must be lightweight and non-blocking: mark blocks as
    in-flight and submit the transfer, but do NOT perform the data copy
    on the calling thread.

    Preconditions (guaranteed by the framework):
      - ``job_metadata.block_ids`` are allocated primary-tier slots
        ready to receive data.

    The implementation must copy data from this tier into the
    primary-tier slots identified by ``block_ids``.

    Report completion via ``get_finished()``.

    Args:
        job_metadata: Job metadata including job_id, keys, and block_ids
                      identifying the primary-tier slots to write into.
    """
    pass

submit_store abstractmethod

submit_store(job_metadata: JobMetadata) -> None

Submit an async job to store blocks from the primary tier to this secondary tier.

This method must be lightweight and non-blocking: allocate metadata and submit the transfer, but do NOT perform the data copy on the calling thread.

Preconditions (guaranteed by the framework): - job_metadata.block_ids are valid primary-tier slots, pinned (ref-counted) for the duration of the transfer.

The implementation is responsible for
  1. Filtering out blocks already present in this tier
  2. Evicting blocks if capacity is needed
  3. Allocating space in this tier
  4. Submitting the async transfer (read from primary via block_ids)

Report completion via get_finished().

Parameters:

Name Type Description Default
job_metadata JobMetadata

Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to read from.

required
Source code in vllm/v1/kv_offload/tiering/base.py
@abstractmethod
def submit_store(self, job_metadata: JobMetadata) -> None:
    """
    Submit an async job to store blocks from the primary tier to this
    secondary tier.

    This method must be lightweight and non-blocking: allocate metadata
    and submit the transfer, but do NOT perform the data copy on the
    calling thread.

    Preconditions (guaranteed by the framework):
      - ``job_metadata.block_ids`` are valid primary-tier slots, pinned
        (ref-counted) for the duration of the transfer.

    The implementation is responsible for:
      1. Filtering out blocks already present in this tier
      2. Evicting blocks if capacity is needed
      3. Allocating space in this tier
      4. Submitting the async transfer (read from primary via block_ids)

    Report completion via ``get_finished()``.

    Args:
        job_metadata: Job metadata including job_id, keys, and block_ids
                      identifying the primary-tier slots to read from.
    """
    pass

touch

touch(
    keys: Collection[OffloadKey], req_context: ReqContext
)

Mark blocks as recently used for eviction policy.

Parameters:

Name Type Description Default
keys Collection[OffloadKey]

Offload keys to mark as recently used.

required
req_context ReqContext

Per-request context.

required
Source code in vllm/v1/kv_offload/tiering/base.py
def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
    """
    Mark blocks as recently used for eviction policy.

    Args:
        keys: Offload keys to mark as recently used.
        req_context: Per-request context.
    """
    return