libinfer 0.0.5

Rust interface to TensorRT for high-performance GPU inference
docs.rs failed to build libinfer-0.0.5
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

libinfer

This library provides a simple Rust interface to a TensorRT engine using cxx

Overview

libinfer allows for seamless integration of TensorRT models into Rust applications with minimal overhead. The library handles the complex C++ interaction with TensorRT while exposing a simple, idiomatic Rust API.

Installation

To use this library, you'll need:

  • CUDA and TensorRT installed on your system
  • Environment variables set properly:
    • TENSORRT_LIBRARIES: Path to TensorRT libraries
    • CUDA_LIBRARIES: Path to CUDA libraries
    • CUDA_INCLUDE_DIRS: Path to CUDA include directories

Add to your Cargo.toml:

[dependencies]
libinfer = "0.0.3"

Usage

The goal of the API is to keep as much processing in Rust land as possible. Here is a sample usage:

let options = Options {
    path: "yolov8n.engine".into(),
    device_index: 0,
};
let mut engine = Engine::new(&options).unwrap();

// Get input dimensions of the engine as [Channels, Height, Width]
let dims = engine.get_input_dims();

// Construct a dummy input (uint8 or float32 depending on model)
let input_size = dims.iter().fold(1, |acc, &e| acc * e as usize);
let input = InputTensor {
    name: "input".to_string();
    data: vec![0u8; input_size];

// Run inference
let output = engine.pin_mut().infer(&input).unwrap();

// Postprocess the output according to your model's output format
// ...

This library is intended to be used with pre-built TensorRT engines created by the Python API or the trtexec CLI tool for the target device.

Features

  • Support for both fixed and dynamic batch sizes
  • Automatic handling of different input data types (UINT8, FP32)
  • Direct access to model dimensions and parameters
  • Error handling via Rust's Result type
  • Logging integration with RUST_LOG environment variable

Examples

Check the examples/ directory for working examples:

  • basic.rs: Simple inference example
  • benchmark.rs: Performance benchmarking with various batch sizes
  • dynamic.rs: Working with dynamic batch sizes
  • functional_test.rs: Testing correctness of model outputs

Run an example with:

cargo run --example basic -- --path /path/to/model.engine

Example Requirements

  • You must provide your own TensorRT engine files (.engine)
  • For the functional_test example, you'll need input.bin and features.txt files
  • To create engine files, use NVIDIA's TensorRT tools such as:
    • TensorRT Python API
    • trtexec command-line tool
    • ONNX -> TensorRT conversion tools

See the documentation in each example file for specific requirements.

Synchronization Model

No cudaStreamSynchronize is needed between H2D copies and enqueueV3. This is safe for several reasons:

  1. Stream ordering — all H2D copies and enqueueV3 are submitted to the same CUDA stream, which guarantees in-order execution. Copies complete before kernels begin.
  2. Pageable host memory — input data comes from Rust Vec<u8> on the regular heap (not pinned memory). cudaMemcpyAsync with pageable memory blocks the CPU until the copy is staged, making a subsequent sync redundant.
  3. TensorRT auxiliary streams — TRT may use auxiliary streams internally during enqueueV3, but it inserts event synchronizations so all auxiliary work waits on the main stream at entry and the main stream waits on all auxiliary work at exit.
  4. CPU-side callssetInputShape, allInputDimensionsSpecified, and setTensorAddress are pure CPU operations with no stream interaction.

A post-inference cudaStreamSynchronize is still required to ensure D2H output copies are complete before reading results. infer() handles this internally.

Current Limitations

  • The underlying engine code is not threadsafe (and the Rust binding does not implement Sync)
  • Engine instances are Send but not Sync
  • Input and output data transfers happen on the CPU-GPU boundary

Future Work

  • Allow passing device pointers and CUDA streams for stream synchronization events
  • Async execution support

Credits

Much of the C++ code is based on the tensorrt-cpp-api repo.