Expand description
§Engine Protocols
This module contains the protocols in public API for the LLM Engine and AsyncEngine facades.
The core components are the CompletionRequest and StreamingCompletionResponse objects.
The StreamingCompletionResponse objects are the outputs of the LLM Engine; however, we
need some additional information to propagate intermediate results for improved observability.
The metadata is transferred via the other arms of the StreamingResponse enum.
Modules§
Structs§
- Calibration
Results - Chat
Completion Logprobs - Chat
Completion Token Logprob - Chat
Context - ChatContext is a struct that contains the role and context of a chat message along with a flattened CompletionContext.
- Chat
Turn - ChatTurn is a struct that contains the user and assistant messages in a chat.
- Completion
Context - Defines the prompt template and system prompt for a completion request. If the model does not support prompt templates, the system_prompt will be ignored.
- Completion
Request - TensorRT LLM does not perform preprocessing or postprocessing. The input_ids / token_ids are expected to be preprocessed by the client. The client is responsible for constructing the model specific prompt template and applying the tokenizer.
- Completion
Request Builder - Builder for
CompletionRequest. - Delta
- Epilogue
- This is the final message that will be emitted by a Engine Response Stream when it finishes without error. In some cases, the engine may emit an error which will indicate the end of the steam. Another case in which an Finalize(Epilogue) will not be emitted is if the response handler has stalled and too many responses
- Loadgen
Results - Output
Options - Collection of options that control what information the inference engine returns in the response.
- Performance
Model - Prologue
- This is the first message that will be emitted by an Engine Response Stream It indicates that the request has been preprocessed and queued for execution on the backend.
- Sampling
Options - Collection of options that control the sampling behavior of the inference engine.
- Scatter
Data - Sequence
Position Data - At each SequencePosition we hold position specific data
- Stats
- Stop
Conditions - TensorRT LLM server-side stop conditions. These options allow for the server to evaluate the generated sequence and stop generation if the sequence meets a stop condition.
- Streaming
Completion Response - TopLogprob
- Trace
- Usage
Enums§
- Completion
Request Builder Error - Error type for CompletionRequestBuilder
- Finish
Reason - LogProbs
- Logits
- Prompt
Type - LLM Inference Engines can accept a variety of input types. Not all Engines will support all
input types. For example, the trtllm::AsyncEngine only supports
PromptType::Tokensas an input type. The higher-levelBackendclass is a general wrapper around Engines that will enable many of the input options that require pre/postprocessing. - Stream
State - Streaming
Response - StreamingResponse is the primary response object for the LLM Engine. The response stream can emit three different types of messages. The Initialize and Finalize messages are optional and primarily used over disaggreated transports to move states from the server to the client.
Constants§
- FREQUENCY_
PENALTY_ RANGE - Frequency Penalty range for sampling.
- TEMPERATURE_
RANGE - Temperature range for sampling.
- TOP_
P_ RANGE - Top P range for sampling.
Traits§
- Sampling
Options Provider - SamplingOptionsProvider is a trait that allows the caller to extract the sampling options from the object that implements it. This will mutate the object.
- Stop
Conditions Provider