Crate tritonserver_rs

Source
Expand description

§Perform easy and efficient ML model inference

This crate is designed to run any Machine Learning model on any architecture with ease and efficiency.
It leverages the Triton Inference Server (specifically the Triton C library) and provides a similar API with comparable advantages. However, Tritonserver-rs allows you to build the inference server locally, offering significant performance benefits. Check the benchmark for more details.


§Usage

Run inference in three simple steps:

§Step 1. Prepare the model repository

Organize your model files in the following structure:

models/
├── yolov8/
|    ├── config.pbtxt
|    ├── 1/
|    │   └── model.onnx
|    ├── 2/
|    │   └── model.onnx
|    └── `<other versions of yolov8>`/
└── `<other models>`/

Rules:

  • All models must be stored in the same root directory (models/ in this example).
  • Each model resides in its own folder containing:
    • A config.pbtxt configuration file.
    • One or more subdirectories, each representing a version of the model and containing the model file (e.g., model.onnx).

§Step 2. Write the code

Add Tritonserver-rs to your Cargo.toml:

[dependencies]
tritonserver-rs = "0.1"

Then write your application code:

use tritonserver_rs::{Buffer, options::Options, Server};
use std::time::Duration;

// Configure server options.
let mut opts = Options::new("models/")?;

opts.exit_timeout(Duration::from_secs(5))?
    .backend_directory("/opt/tritonserver/backends")?;

// Create the server.
let server = Server::new(opts).await?;

// Input data.
let image = image::open("/data/cats.jpg")?;
let image = image.as_flat_samples_u8();

// Create a request (specify the model name and version).
let mut request = server.create_request("yolov8", 2)?;

// Add input data and an allocator.
request
    .add_default_allocator()
    .add_input("IMAGE", Buffer::from(image))?;

// Run inference.
let fut = request.infer_async()?;

// Obtain results.
let response = fut.await?;

§Step 3. Deploy

Here is an example of how to deploy using docker-compose.yml:

my_app:
  image: {DEV_IMAGE}
  volumes:
    - ./Cargo.toml:/project/
    - ./src:/project/src
    - ../models:/models
    - ../cats.jpg:/data/cats.jpg
  entrypoint: ["cargo", "run", "--manifest-path=/project/Cargo.toml"]

We recommend using Dockerfile.dev as {DEV_IMAGE}. For more details on suitable images and deployment instructions, see DEPLOY.md.


§More Information

For further details, check out the following resources (in github repo):


§Advantages of the Crate

  • Versatility: Extensive configuration options for models and servers.
  • High performance: Optimized for maximum efficiency.
  • Broad backend support: Run PyTorch, ONNX, TensorFlow, TensorRT, OpenVINO, model pipelines, and custom backends out of the box.
  • Compatibility: Supports most GPUs and architectures.
  • Multi-model handling: Handle multiple models simultaneously.
  • Prometheus integration: Built-in support for monitoring.
  • CUDA-optimized: Directly handle model inputs and outputs on GPU memory.
  • Dynamic server management: Advanced runtime control features.
  • Rust-based: Enjoy the safety, speed, and concurrency benefits of Rust.

§Tritonserver C-lib API version

1.33 (Minimal TRITON_CONTAINER_VERSION=23.07).

Re-exports§

pub use crate::error::Error;
pub use crate::error::ErrorCode;
pub use crate::macros::run_in_context;
pub use crate::macros::run_in_context_sync;
pub use crate::memory::Buffer;
pub use crate::memory::MemoryType;
pub use crate::request::Allocator;
pub use crate::request::Request;
pub use crate::response::Response;
pub use crate::server::Server;
pub use context::get_context;
pub use context::init_cuda;

Modules§

context
Cuda context for managing device execution.
error
Error types for Tritonserver-rs.
macros
Macros to run some Cuda operations in context.
memory
Memory management utilities for model inference. Module responsible for memory allocation and assignments.
message
Metadata message serialization/deserialization.
metrics
Performance metrics collection and reporting.
options
Configuration options for Tritonserver-rs server.
parameter
Model inference requests and server parameters.
request
Request builder and utilities for Triton server inference.
response
Response handling and parsing from Triton server.
server
Server initialization and lifecycle management.
trace
Tracing utilities for debugging and profiling.

Constants§

TRITONSERVER_API_VERSION_MAJOR
TRITONSERVER_API_VERSION_MINOR

Functions§

api_version
Get the TRITONBACKEND API version supported by the Triton library. This value can be compared against the TRITONSERVER_API_VERSION_MAJOR and TRITONSERVER_API_VERSION_MINOR used to build the client to ensure that Triton shared library is compatible with the client.