Skip to main content

Crate atomr_infer_runtime_tensorrt

Crate atomr_infer_runtime_tensorrt 

Source
Expand description

§inference-runtime-tensorrt

NVIDIA TensorRT runner — wraps atomr-accel-tensorrt’s TrtRuntime / ExecutionContext / ExecutionBindings behind the ModelRunner trait. Doc §2.2, §10.3.

§Feature flags

  • tensorrt — pull in the upstream Phase 8 crate. Without this feature the runner compiles to a typed-error stub so a cargo build --features remote-only consumer never pulls cudarc / libnvinfer / nvonnxparser.
  • tensorrt-link — actually link libnvinfer.so at build time. Off-by-default: with the tensorrt feature alone, the runner compiles and unit-tests work without TensorRT installed; runtime calls return atomr_accel_tensorrt::error::TrtError::NotLinked mapped to InferenceError::Internal.
  • tensorrt-onnx / tensorrt-int8 / tensorrt-fp8 / tensorrt-plugin — forwarded straight to the upstream crate so callers can compose ONNX import, INT8 PTQ, FP8 PTQ, and IPluginV3 trampolines with the same dep on this crate.

§What this runner does

  1. Reads the engine plan bytes from config.plan_path at construction time. Missing / unreadable plan ⇒ InferenceError::Internal.
  2. Lazily builds a TrtRuntime, deserialises the plan into a shared Arc<TrtEngine>, and constructs the per-request ExecutionContext inside ModelRunner::execute.
  3. Allocates a CUDA stream on the configured device_id so enqueueV3 can ride a real timeline. Operators wiring this runner alongside atomr-accel-cuda::DeviceActor should swap the lazy stream out via TensorRtRunner::with_stream (under the tensorrt feature) so the two actors share one execution timeline.

§What this runner does not do

Tokenisation. The ExecuteBatch shape is a chat-style Vec<Message> + sampling params; TensorRT engines consume raw tensors. The runner therefore exposes a TensorRtRunner::enqueue method (under the tensorrt feature) for callers that have already produced device pointers via ExecutionBindings, and ModelRunner::execute returns a typed InferenceError::Internal pointing the caller at the tokeniser-specific path. A future revision can layer an LLM-aware adapter on top.

Structs§

TensorRtConfig
Engine-loading configuration.
TensorRtRunner
ModelRunner that drives an immutable TensorRT engine.

Enums§

TrtPrecision
Serializable mirror of atomr_accel_tensorrt::builder::Precision so configs can be parsed without pulling the upstream crate.