usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).
🌟 Highlights
- ⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
- ✨ Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
- 🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
- 🛠️ Full-Stack Suite:
DataLoader,Annotator, andViewerfor complete workflows - 🏗️ Unified API: Single
Modeltrait inference withrun()/forward()/encode_images()/encode_texts()and unifiedYoutput - 📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
- 📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
- 🌱 Model Ecosystem: 50+ SOTA vision and VLM models
🚀 Quick Start
Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:
- Tasks:
detect,segment,pose,classify,obb - Versions:
v5,v6,v7,v8,v9,v10,11,12,v13,26 - Scales:
n,s,m,l,x - Precision:
fp32,fp16,q8,int8,q4,q4f16,bnb4, and more - Execution Providers:
CPU,CUDA,TensorRT,TensorRT-RTX,CoreML,OpenVINO, and more
Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F
Setup: YOLO26 Detection, COCO2017-val (5,000 images), 640x640, Conf thresholds: [0.35, 0.3, ..]
Results are for rough reference only.
| Scale | EP | ImageProcessor | DType | Batch | Preprocess | Inference | Postprocess | Total |
|---|---|---|---|---|---|---|---|---|
| n | TensorRT | CUDA | FP16 | 1 | ~233µs | ~1.3ms | ~14µs | ~1.55ms |
| n | TensorRT-RTX | CUDA | FP32 | 1 | ~233µs | ~2.0ms | ~10µs | ~2.24ms |
| n | TensorRT-RTX | CUDA | FP16 | 1 | ❓ | ❓ | ❓ | ❓ |
| n | CUDA | CUDA | FP32 | 1 | ~233µs | ~5.0ms | ~17µs | ~5.25ms |
| n | CUDA | CUDA | FP16 | 1 | ~233µs | ~3.6ms | ~17µs | ~3.85ms |
| n | CUDA | CPU | FP32 | 1 | ~800µs | ~6.5ms | ~14µs | ~7.31ms |
| n | CUDA | CPU | FP16 | 1 | ~800µs | ~5.0ms | ~14µs | ~5.81ms |
| n | CPU | CPU | FP32 | 1 | ~970µs | ~20.5ms | ~14µs | ~21.48ms |
| n | CPU | CPU | FP16 | 1 | ~970µs | ~25.0ms | ~14µs | ~25.98ms |
| n | TensorRT | CUDA | FP16 | 8 | ~1.2ms | ~6.0ms | ~55µs | ~7.26ms |
| n | TensorRT | CPU | FP16 | 8 | ~18.0ms | ~25.5ms | ~55µs | ~43.56ms |
| m | TensorRT | CUDA | FP16 | 1 | ~233µs | ~3.6ms | ~14µs | ~3.85ms |
| m | TensorRT | CUDA | Int8 | 1 | ~233µs | ~2.6ms | ~14µs | ~2.84ms |
| m | CUDA | CUDA | FP32 | 1 | ~233µs | ~16.1ms | ~17µs | ~16.35ms |
| m | CUDA | CUDA | FP16 | 1 | ~233µs | ~8.8ms | ~17µs | ~9.05ms |
What's Next?
- 📖 Online Documentation
- 📚 API Reference
- 🚀 Examples
🤝 Contributing
This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.
We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.
🙏 Acknowledgments
-
This project is built on top of ort (ONNX Runtime for Rust), which provides seamless Rust bindings for ONNX Runtime. Special thanks to the
ortmaintainers. -
Special thanks to @kadu-v for the jamtrack-rs project, which inspired our ByteTracker implementation
Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.
📜 License
This project is licensed under LICENSE.