vllm.v1.kv_offload.tiering.base ¶

Abstract interfaces and data types for the secondary tiering layer.

JobMetadata `dataclass` ¶

Metadata for an in-flight async transfer job.

Source code in vllm/v1/kv_offload/tiering/base.py

@dataclass
class JobMetadata:
    """Metadata for an in-flight async transfer job."""

    job_id: JobId
    keys: Collection[OffloadKey]
    block_ids: np.ndarray
    is_promotion: bool
    req_context: ReqContext

JobResult `dataclass` ¶

Result of an async transfer job (successful or failed).

Source code in vllm/v1/kv_offload/tiering/base.py

@dataclass
class JobResult:
    """Result of an async transfer job (successful or failed)."""

    job_id: JobId
    success: bool

SecondaryTierManager ¶

Bases: ABC

Abstract interface for managing a single non-primary offloading tier.

Secondary tiers cannot directly access GPU memory. All data transfers must go through the CPU (primary) tier: - Store: GPU → CPU (primary) → secondary (cascade) - Load: secondary → CPU (primary) → GPU (promotion)

IMPORTANT: All methods run in the Scheduler process and must be lightweight and non-blocking. submit_load() and submit_store() submit async jobs; get_finished() polls for completion.

Source code in vllm/v1/kv_offload/tiering/base.py

class SecondaryTierManager(ABC):
    """
    Abstract interface for managing a single non-primary offloading tier.

    Secondary tiers cannot directly access GPU memory. All data transfers
    must go through the CPU (primary) tier:
      - Store: GPU → CPU (primary) → secondary  (cascade)
      - Load:  secondary → CPU (primary) → GPU  (promotion)

    IMPORTANT: All methods run in the Scheduler process and must be
    lightweight and non-blocking. submit_load() and submit_store() submit
    async jobs; get_finished() polls for completion.
    """

    def __init__(
        self,
        offloading_spec: "OffloadingSpec",
        primary_kv_view: memoryview,
        tier_type: str,
    ) -> None:
        """
        Args:
            offloading_spec: Offloading configuration.
            primary_kv_view: Memoryview of the primary tier's CPU KV cache.
            tier_type: Tier type identifier, set by SecondaryTierFactory
                from the registered tier type.
        """
        self._offloading_spec = offloading_spec
        self._primary_kv_view: memoryview = primary_kv_view
        self.tier_type = tier_type

    @abstractmethod
    def lookup(self, key: OffloadKey, req_context: ReqContext) -> bool | None:
        """
        Check whether a block exists in this secondary tier.

        Args:
            key: Offload key to look up.
            req_context: per-request context (e.g. kv_transfer_params).

        Returns:
            True if the block is present and ready,
            False if not found,
            or None if the block is being transferred (retry later).
        """
        pass

    @abstractmethod
    def submit_store(self, job_metadata: JobMetadata) -> None:
        """
        Submit an async job to store blocks from the primary tier to this
        secondary tier.

        This method must be lightweight and non-blocking: allocate metadata
        and submit the transfer, but do NOT perform the data copy on the
        calling thread.

        Preconditions (guaranteed by the framework):
          - ``job_metadata.block_ids`` are valid primary-tier slots, pinned
            (ref-counted) for the duration of the transfer.

        The implementation is responsible for:
          1. Filtering out blocks already present in this tier
          2. Evicting blocks if capacity is needed
          3. Allocating space in this tier
          4. Submitting the async transfer (read from primary via block_ids)

        Report completion via ``get_finished()``.

        Args:
            job_metadata: Job metadata including job_id, keys, and block_ids
                          identifying the primary-tier slots to read from.
        """
        pass

    @abstractmethod
    def submit_load(self, job_metadata: JobMetadata) -> None:
        """
        Submit an async job to load blocks from this secondary tier to the
        primary tier.

        This method must be lightweight and non-blocking: mark blocks as
        in-flight and submit the transfer, but do NOT perform the data copy
        on the calling thread.

        Preconditions (guaranteed by the framework):
          - ``job_metadata.block_ids`` are allocated primary-tier slots
            ready to receive data.

        The implementation must copy data from this tier into the
        primary-tier slots identified by ``block_ids``.

        Report completion via ``get_finished()``.

        Args:
            job_metadata: Job metadata including job_id, keys, and block_ids
                          identifying the primary-tier slots to write into.
        """
        pass

    @abstractmethod
    def get_finished(self) -> Iterable[JobResult]:
        """
        Return all jobs (loads and stores) that completed since the last call.

        The framework uses these results to release resources and finalize
        transfers.

        Returns:
            Iterable of JobResult objects for jobs finished since the
            last call.
        """
        pass

    def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
        """
        Mark blocks as recently used for eviction policy.

        Args:
            keys: Offload keys to mark as recently used.
            req_context: Per-request context.
        """
        return

    def shutdown(self) -> None:
        """Release resources held by this tier (threads, connections, etc.)."""
        return

init ¶

__init__(
    offloading_spec: OffloadingSpec,
    primary_kv_view: memoryview,
    tier_type: str,
) -> None

Parameters:

Name	Type	Description	Default
`offloading_spec`	`OffloadingSpec`	Offloading configuration.	required
`primary_kv_view`	`memoryview`	Memoryview of the primary tier's CPU KV cache.	required
`tier_type`	`str`	Tier type identifier, set by SecondaryTierFactory from the registered tier type.	required

Source code in vllm/v1/kv_offload/tiering/base.py

def __init__(
    self,
    offloading_spec: "OffloadingSpec",
    primary_kv_view: memoryview,
    tier_type: str,
) -> None:
    """
    Args:
        offloading_spec: Offloading configuration.
        primary_kv_view: Memoryview of the primary tier's CPU KV cache.
        tier_type: Tier type identifier, set by SecondaryTierFactory
            from the registered tier type.
    """
    self._offloading_spec = offloading_spec
    self._primary_kv_view: memoryview = primary_kv_view
    self.tier_type = tier_type

get_finished `abstractmethod` ¶

get_finished() -> Iterable[JobResult]

Return all jobs (loads and stores) that completed since the last call.

The framework uses these results to release resources and finalize transfers.

Returns:

Type	Description
`Iterable[JobResult]`	Iterable of JobResult objects for jobs finished since the
`Iterable[JobResult]`	last call.

Source code in vllm/v1/kv_offload/tiering/base.py

@abstractmethod
def get_finished(self) -> Iterable[JobResult]:
    """
    Return all jobs (loads and stores) that completed since the last call.

    The framework uses these results to release resources and finalize
    transfers.

    Returns:
        Iterable of JobResult objects for jobs finished since the
        last call.
    """
    pass

lookup `abstractmethod` ¶

lookup(
    key: OffloadKey, req_context: ReqContext
) -> bool | None

Check whether a block exists in this secondary tier.

Parameters:

Name	Type	Description	Default
`key`	`OffloadKey`	Offload key to look up.	required
`req_context`	`ReqContext`	per-request context (e.g. kv_transfer_params).	required

Returns:

Type	Description
`bool \| None`	True if the block is present and ready,
`bool \| None`	False if not found,
`bool \| None`	or None if the block is being transferred (retry later).

Source code in vllm/v1/kv_offload/tiering/base.py

@abstractmethod
def lookup(self, key: OffloadKey, req_context: ReqContext) -> bool | None:
    """
    Check whether a block exists in this secondary tier.

    Args:
        key: Offload key to look up.
        req_context: per-request context (e.g. kv_transfer_params).

    Returns:
        True if the block is present and ready,
        False if not found,
        or None if the block is being transferred (retry later).
    """
    pass

shutdown ¶

shutdown() -> None

Release resources held by this tier (threads, connections, etc.).

Source code in vllm/v1/kv_offload/tiering/base.py

def shutdown(self) -> None:
    """Release resources held by this tier (threads, connections, etc.)."""
    return

submit_load `abstractmethod` ¶

submit_load(job_metadata: JobMetadata) -> None

Submit an async job to load blocks from this secondary tier to the primary tier.

This method must be lightweight and non-blocking: mark blocks as in-flight and submit the transfer, but do NOT perform the data copy on the calling thread.

Preconditions (guaranteed by the framework): - job_metadata.block_ids are allocated primary-tier slots ready to receive data.

The implementation must copy data from this tier into the primary-tier slots identified by block_ids.

Report completion via get_finished().

Parameters:

Name	Type	Description	Default
`job_metadata`	`JobMetadata`	Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to write into.	required

Source code in vllm/v1/kv_offload/tiering/base.py

@abstractmethod
def submit_load(self, job_metadata: JobMetadata) -> None:
    """
    Submit an async job to load blocks from this secondary tier to the
    primary tier.

    This method must be lightweight and non-blocking: mark blocks as
    in-flight and submit the transfer, but do NOT perform the data copy
    on the calling thread.

    Preconditions (guaranteed by the framework):
      - ``job_metadata.block_ids`` are allocated primary-tier slots
        ready to receive data.

    The implementation must copy data from this tier into the
    primary-tier slots identified by ``block_ids``.

    Report completion via ``get_finished()``.

    Args:
        job_metadata: Job metadata including job_id, keys, and block_ids
                      identifying the primary-tier slots to write into.
    """
    pass

submit_store `abstractmethod` ¶

submit_store(job_metadata: JobMetadata) -> None

Submit an async job to store blocks from the primary tier to this secondary tier.

This method must be lightweight and non-blocking: allocate metadata and submit the transfer, but do NOT perform the data copy on the calling thread.

Preconditions (guaranteed by the framework): - job_metadata.block_ids are valid primary-tier slots, pinned (ref-counted) for the duration of the transfer.

The implementation is responsible for

Filtering out blocks already present in this tier
Evicting blocks if capacity is needed
Allocating space in this tier
Submitting the async transfer (read from primary via block_ids)

Report completion via get_finished().

Parameters:

Name	Type	Description	Default
`job_metadata`	`JobMetadata`	Job metadata including job_id, keys, and block_ids identifying the primary-tier slots to read from.	required

Source code in vllm/v1/kv_offload/tiering/base.py

@abstractmethod
def submit_store(self, job_metadata: JobMetadata) -> None:
    """
    Submit an async job to store blocks from the primary tier to this
    secondary tier.

    This method must be lightweight and non-blocking: allocate metadata
    and submit the transfer, but do NOT perform the data copy on the
    calling thread.

    Preconditions (guaranteed by the framework):
      - ``job_metadata.block_ids`` are valid primary-tier slots, pinned
        (ref-counted) for the duration of the transfer.

    The implementation is responsible for:
      1. Filtering out blocks already present in this tier
      2. Evicting blocks if capacity is needed
      3. Allocating space in this tier
      4. Submitting the async transfer (read from primary via block_ids)

    Report completion via ``get_finished()``.

    Args:
        job_metadata: Job metadata including job_id, keys, and block_ids
                      identifying the primary-tier slots to read from.
    """
    pass

touch ¶

touch(
    keys: Collection[OffloadKey], req_context: ReqContext
)

Mark blocks as recently used for eviction policy.

Parameters:

Name	Type	Description	Default
`keys`	`Collection[OffloadKey]`	Offload keys to mark as recently used.	required
`req_context`	`ReqContext`	Per-request context.	required

Source code in vllm/v1/kv_offload/tiering/base.py

def touch(self, keys: Collection[OffloadKey], req_context: ReqContext):
    """
    Mark blocks as recently used for eviction policy.

    Args:
        keys: Offload keys to mark as recently used.
        req_context: Per-request context.
    """
    return

vllm.v1.kv_offload.tiering.base ¶

JobMetadata dataclass ¶

JobResult dataclass ¶

SecondaryTierManager ¶

__init__ ¶

get_finished abstractmethod ¶

lookup abstractmethod ¶

shutdown ¶

submit_load abstractmethod ¶

submit_store abstractmethod ¶

touch ¶

JobMetadata `dataclass` ¶

JobResult `dataclass` ¶

init ¶

get_finished `abstractmethod` ¶

lookup `abstractmethod` ¶

submit_load `abstractmethod` ¶

submit_store `abstractmethod` ¶