Module common

Expand description

§Engine Protocols

This module contains the protocols in public API for the LLM Engine and AsyncEngine facades.

The core components are the CompletionRequest and StreamingCompletionResponse objects.

The StreamingCompletionResponse objects are the outputs of the LLM Engine; however, we need some additional information to propagate intermediate results for improved observability. The metadata is transferred via the other arms of the StreamingResponse enum.

Modules§

llm_backend
postprocessor
preprocessor

Structs§

CalibrationResults
ChatCompletionLogprobs
ChatCompletionTokenLogprob
ChatContext: ChatContext is a struct that contains the role and context of a chat message along with a flattened CompletionContext.
ChatTurn: ChatTurn is a struct that contains the user and assistant messages in a chat.
CompletionContext: Defines the prompt template and system prompt for a completion request. If the model does not support prompt templates, the system_prompt will be ignored.
CompletionRequest: TensorRT LLM does not perform preprocessing or postprocessing. The input_ids / token_ids are expected to be preprocessed by the client. The client is responsible for constructing the model specific prompt template and applying the tokenizer.
CompletionRequestBuilder: Builder for CompletionRequest.
Delta
Epilogue: This is the final message that will be emitted by a Engine Response Stream when it finishes without error. In some cases, the engine may emit an error which will indicate the end of the steam. Another case in which an Finalize(Epilogue) will not be emitted is if the response handler has stalled and too many responses
LoadgenResults
OutputOptions: Collection of options that control what information the inference engine returns in the response.
PerformanceModel
Prologue: This is the first message that will be emitted by an Engine Response Stream It indicates that the request has been preprocessed and queued for execution on the backend.
SamplingOptions: Collection of options that control the sampling behavior of the inference engine.
ScatterData
SequencePositionData: At each SequencePosition we hold position specific data
Stats
StopConditions: TensorRT LLM server-side stop conditions. These options allow for the server to evaluate the generated sequence and stop generation if the sequence meets a stop condition.
StreamingCompletionResponse
TopLogprob
Trace
Usage

Enums§

CompletionRequestBuilderError: Error type for CompletionRequestBuilder
FinishReason
LogProbs
Logits
PromptType: LLM Inference Engines can accept a variety of input types. Not all Engines will support all input types. For example, the trtllm::AsyncEngine only supports PromptType::Tokens as an input type. The higher-level Backend class is a general wrapper around Engines that will enable many of the input options that require pre/postprocessing.
StreamState
StreamingResponse: StreamingResponse is the primary response object for the LLM Engine. The response stream can emit three different types of messages. The Initialize and Finalize messages are optional and primarily used over disaggreated transports to move states from the server to the client.

Constants§

FREQUENCY_PENALTY_RANGE: Frequency Penalty range for sampling.
TEMPERATURE_RANGE: Temperature range for sampling.
TOP_P_RANGE: Top P range for sampling.

Traits§

SamplingOptionsProvider: SamplingOptionsProvider is a trait that allows the caller to extract the sampling options from the object that implements it. This will mutate the object.
StopConditionsProvider

Module commonCopy item path