usls 0.2.0-alpha.3

A Rust library integrated with ONNXRuntime, providing a collection of ML models.
Documentation

usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).

🌟 Highlights

  • ⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
  • ✨ Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
  • 🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
  • 🛠️ Full-Stack Suite: DataLoader, Annotator, and Viewer for complete workflows
  • 🏗️ Unified API: Single Model trait inference with run()/forward()/encode_images()/encode_texts() and unified Y output
  • 📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
  • 📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
  • 🌱 Model Ecosystem: 50+ SOTA vision and VLM models

🚀 Quick Start

Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:

  • Tasks: detect, segment, pose, classify, obb
  • Versions: v5, v6, v7, v8, v9, v10, 11, 12, v13, 26
  • Scales: n, s, m, l, x
  • Precision: fp32, fp16, q8, int8, q4, q4f16, bnb4, and more
  • Execution Providers: CPU, CUDA, TensorRT, TensorRT-RTX, CoreML, OpenVINO, and more
cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16
cargo run -r -F cuda --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cuda:0
cargo run -r -F tensorrt-full --example yolo -- --device tensorrt:0 --processor-device cuda:0
cargo run -r -F nvrtx-full --example yolo -- --device nvrtx:0 --processor-device cuda:0
cargo run -r -F coreml --example yolo -- --device coreml
cargo run -r -F openvino -F ort-load-dynamic --example yolo -- --device openvino:CPU

Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F

Setup: YOLO26 Detection, COCO2017-val (5,000 images), 640x640, Conf thresholds: [0.35, 0.3, ..]

Results are for rough reference only.

Scale EP ImageProcessor DType Batch Preprocess Inference Postprocess Total
n TensorRT CUDA FP16 1 ~233µs ~1.3ms ~14µs ~1.55ms
n TensorRT-RTX CUDA FP32 1 ~233µs ~2.0ms ~10µs ~2.24ms
n TensorRT-RTX CUDA FP16 1
n CUDA CUDA FP32 1 ~233µs ~5.0ms ~17µs ~5.25ms
n CUDA CUDA FP16 1 ~233µs ~3.6ms ~17µs ~3.85ms
n CUDA CPU FP32 1 ~800µs ~6.5ms ~14µs ~7.31ms
n CUDA CPU FP16 1 ~800µs ~5.0ms ~14µs ~5.81ms
n CPU CPU FP32 1 ~970µs ~20.5ms ~14µs ~21.48ms
n CPU CPU FP16 1 ~970µs ~25.0ms ~14µs ~25.98ms
n TensorRT CUDA FP16 8 ~1.2ms ~6.0ms ~55µs ~7.26ms
n TensorRT CPU FP16 8 ~18.0ms ~25.5ms ~55µs ~43.56ms
m TensorRT CUDA FP16 1 ~233µs ~3.6ms ~14µs ~3.85ms
m TensorRT CUDA Int8 1 ~233µs ~2.6ms ~14µs ~2.84ms
m CUDA CUDA FP32 1 ~233µs ~16.1ms ~17µs ~16.35ms
m CUDA CUDA FP16 1 ~233µs ~8.8ms ~17µs ~9.05ms

What's Next?

  • 📖 Online Documentation
  • 📚 API Reference
  • 🚀 Examples

🤝 Contributing

This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.

We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.

🙏 Acknowledgments

Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.

📜 License

This project is licensed under LICENSE.