vllm.v1.outputs ¶
AsyncModelRunnerOutput ¶
Bases: ABC
Source code in vllm/v1/outputs.py
get_output abstractmethod ¶
Get the ModelRunnerOutput for this async output.
This is a blocking call that waits until the results are ready, which might involve copying device tensors to the host. This method should only be called once per AsyncModelRunnerOutput.
Source code in vllm/v1/outputs.py
LogprobsTensors ¶
Bases: NamedTuple
Source code in vllm/v1/outputs.py
empty_cpu staticmethod ¶
empty_cpu(
num_positions: int, num_tokens_per_position: int
) -> LogprobsTensors
Create empty LogprobsTensors on CPU.
Source code in vllm/v1/outputs.py
filter ¶
filter(mask: Tensor) -> LogprobsTensors
Filter the logprobs tensors with the given bool mask.
Source code in vllm/v1/outputs.py
RoutedExpertsLists ¶
Bases: NamedTuple
CPU-side routed experts, the form :meth:RoutedExpertsManager.store_batch consumes.
Batched per scheduler step: the leading dim is the number of tokens scheduled across all requests in this step (total_num_scheduled_tokens), not per-request tokens. slot_mapping[i] tells the scheduler which physical KV-cache slot row i of routing_data belongs to.
Source code in vllm/v1/outputs.py
RoutedExpertsTensors ¶
Bases: NamedTuple
Device-side snapshot of routed experts data, pending async D2H.
Produced by :class:GPUModelRunner at the end of each async-scheduled step. The copy stream waits on the default stream, then issues non-blocking D2H via :meth:to_cpu_nonblocking into a pinned CPU buffer; :class:AsyncGPUModelRunnerOutput.get_output synchronizes the copy before the scheduler reads it.
Sliced to total_num_scheduled_tokens (step-level, across all requests — NOT per-request). Both routing_data and slot_mapping must be private clones when sourced from shared capturer / prepare-input buffers, so the next forward pass / _prepare_inputs on the default stream does not race with a D2H still pending on the copy stream.
Source code in vllm/v1/outputs.py
to_cpu_nonblocking ¶
to_cpu_nonblocking() -> RoutedExpertsTensors
Issue non-blocking D2H on the current stream.
NOTE: non_blocking=True only delivers true overlap when the CPU target is pinned. The current fallback here allocates a new pageable CPU tensor per call, which silently degrades to a synchronous copy; acceptable because the sync happens on the dedicated copy stream, not the default stream.
Source code in vllm/v1/outputs.py
tolists ¶
tolists() -> RoutedExpertsLists
Convert to the numpy-backed form consumed by the scheduler.
.cpu() is a no-op when the tensor is already on CPU, so this is cheap for the post-D2H case; for raw device tensors it will synchronously block, which is only reached in tests.
Source code in vllm/v1/outputs.py
make_empty_encoder_model_runner_output ¶
Create a ModelRunnerOutput stub that contains the correct per-request bookkeeping but no generated data yet.