usls 0.2.0-alpha.3

usls is a cross-platform Rust library powered by ONNX Runtime for efficient inference of SOTA vision and vision-language models (typically under 1B parameters).

🌟 Highlights

⚡ High Performance: Multi-threading, SIMD, and CUDA-accelerated processing
✨ Cross-Platform: Linux, macOS, Windows with ONNX Runtime execution providers (CUDA, TensorRT, CoreML, OpenVINO, DirectML, etc.)
🎯 Precision Support: FP32, FP16, INT8, UINT8, Q4, Q4F16, BNB4, and more
🛠️ Full-Stack Suite: DataLoader, Annotator, and Viewer for complete workflows
🏗️ Unified API: Single Model trait inference with run()/forward()/encode_images()/encode_texts() and unified Y output
📥 Auto-Management: Automatic model download (HuggingFace/GitHub), caching and path resolution
📦 Multiple Inputs: Image, directory, video, webcam, stream and combinations
🌱 Model Ecosystem: 50+ SOTA vision and VLM models

🚀 Quick Start

Run the YOLO-Series demo to explore models with different tasks, precision and execution providers:

Tasks: detect, segment, pose, classify, obb
Versions: v5, v6, v7, v8, v9, v10, 11, 12, v13, 26
Scales: n, s, m, l, x
Precision: fp32, fp16, q8, int8, q4, q4f16, bnb4, and more
Execution Providers: CPU, CUDA, TensorRT, TensorRT-RTX, CoreML, OpenVINO, and more

cargo run -r --example yolo -- --task detect --ver 26 --scale n --dtype fp16

cargo run -r -F cuda --example yolo -- --task segment --ver 11 --scale m --device cuda:0 --processor-device cuda:0

cargo run -r -F tensorrt-full --example yolo -- --device tensorrt:0 --processor-device cuda:0

cargo run -r -F nvrtx-full --example yolo -- --device nvrtx:0 --processor-device cuda:0

cargo run -r -F coreml --example yolo -- --device coreml

cargo run -r -F openvino -F ort-load-dynamic --example yolo -- --device openvino:CPU

Environment: NVIDIA RTX 3060Ti (TensorRT-10.11.0.33, CUDA 12.8, TensorRT-RTX-1.3.0.35) / Intel i5-12400F

Setup: YOLO26 Detection, COCO2017-val (5,000 images), 640x640, Conf thresholds: [0.35, 0.3, ..]

Results are for rough reference only.

Scale	EP	ImageProcessor	DType	Batch	Preprocess	Inference	Postprocess	Total
n	TensorRT	CUDA	FP16	1	~233µs	~1.3ms	~14µs	~1.55ms
n	TensorRT-RTX	CUDA	FP32	1	~233µs	~2.0ms	~10µs	~2.24ms
n	TensorRT-RTX	CUDA	FP16	1	❓	❓	❓	❓
n	CUDA	CUDA	FP32	1	~233µs	~5.0ms	~17µs	~5.25ms
n	CUDA	CUDA	FP16	1	~233µs	~3.6ms	~17µs	~3.85ms
n	CUDA	CPU	FP32	1	~800µs	~6.5ms	~14µs	~7.31ms
n	CUDA	CPU	FP16	1	~800µs	~5.0ms	~14µs	~5.81ms
n	CPU	CPU	FP32	1	~970µs	~20.5ms	~14µs	~21.48ms
n	CPU	CPU	FP16	1	~970µs	~25.0ms	~14µs	~25.98ms
n	TensorRT	CUDA	FP16	8	~1.2ms	~6.0ms	~55µs	~7.26ms
n	TensorRT	CPU	FP16	8	~18.0ms	~25.5ms	~55µs	~43.56ms
m	TensorRT	CUDA	FP16	1	~233µs	~3.6ms	~14µs	~3.85ms
m	TensorRT	CUDA	Int8	1	~233µs	~2.6ms	~14µs	~2.84ms
m	CUDA	CUDA	FP32	1	~233µs	~16.1ms	~17µs	~16.35ms
m	CUDA	CUDA	FP16	1	~233µs	~8.8ms	~17µs	~9.05ms

What's Next?

📖 Online Documentation
📚 API Reference
🚀 Examples

🤝 Contributing

This is a personal project maintained in spare time, so progress on performance optimization and new model support may vary.

We highly welcome PRs for model optimization! If you have expertise in specific models and can help optimize their interfaces or post-processing, your contributions would be invaluable. Feel free to open an issue or submit a pull request for suggestions, bug reports, or new features.

🙏 Acknowledgments

This project is built on top of ort (ONNX Runtime for Rust), which provides seamless Rust bindings for ONNX Runtime. Special thanks to the ort maintainers.
Special thanks to @kadu-v for the jamtrack-rs project, which inspired our ByteTracker implementation

Thanks to all the open-source libraries and their maintainers that make this project possible. See Cargo.toml for a complete list of dependencies.

📜 License

This project is licensed under LICENSE.