object_detector
Easy-to-use object detection and instance segmentation in Rust, powered by ONNX Runtime and the YOLOE-26 (Real-Time Seeing Anything) model family.

object_detector allows you to detect and segment virtually any object in an image. It supports two main modes:
Promptable (where you describe what to find in text) and Prompt-Free (where it uses a built-in vocabulary
of 4,500+ tags).
Features
- Open-Vocabulary Detection: Describe objects using natural language (e.g., "blue toaster", "person with a hat").
- Built-in Vocabulary: Detect 4,585 different object categories out-of-the-box in Prompt-Free mode.
- Instance Segmentation: Get high-quality pixel-level masks for every detected object.
- Flexible Scaling: Choose from 5 model sizes (Nano to XLarge) to balance speed and accuracy.
- Automatic Model Management: Models can be automatically downloaded from Hugging Face on the first run.
- Hardware Acceleration: Full support for CUDA, TensorRT, CoreML, DirectML, and more via
ort. - Pure Rust API: Clean, builder-based interface for both initialization and inference.
Usage
use ;
use Path;
async
Configuring parameters
The following example demonstrates how to configure the detector using the builder pattern for both initialization and inference.
use ;
use CUDA;
async
Advanced Usage
Local Models
If you prefer to manage your own ONNX files or want to skip the Hugging Face integration, you can instantiate the detectors directly using local paths.
use ;
use TextEmbedder;
Generating ONNX files
If you don't want to use the ONNX files hosted on HuggingFace, you can generate them yourself.
uv run scripts/export_onnx.py
Processing and Visualizing Results
Each DetectedObject contains a bounding box, a score, a class ID, a tag string, and an optional pixel-level mask.
For a complete visualization example using imageproc and drawing text/masks onto images,
see examples/visualize.rs.*
Execution Providers (Nvidia, AMD, Intel, Mac, Arm, etc.)
To use hardware acceleration, you must enable the corresponding feature in your Cargo.toml. Then, pass the provider
during initialization:
Model Selection Guide
This crate utilizes YOLOE-26 (Real-Time Seeing Anything), a state-of-the-art open-vocabulary model family built upon the Ultralytics YOLO26 architecture. Unlike traditional YOLO models limited to a fixed set of categories (like COCO's 80 classes), YOLOE-26 can detect and segment virtually any object.
Performance Benchmarks
The following results demonstrate the execution time (latency) across different scales and modes.

[!NOTE] Benchmark Environment: Ryzen 5800X3D CPU | RTX 2080Ti GPU | CUDA Execution Provider. For Promptable modes, text embeddings (CLIP) are cached, as they would be when performing repeated inference during normal use.
Model Scales: Speed vs. Accuracy
The crate supports five model scales. Choosing the right one depends on your hardware and accuracy requirements:
| Scale | Parameters | Accuracy (mAP) | Description | Best For |
|---|---|---|---|---|
| Nano (N) | ~4.8M | 41.0 | Fastest inference speed. | Edge devices, mobile applications, and low-power CPUs. |
| Small (S) | ~13.1M | 48.7 | Balanced efficiency and accuracy. | Real-time desktop applications and mid-range IoT devices. |
| Medium (M) | ~27.9M | 53.1 | High accuracy with moderate latency. | GPU inference where precision is a priority. |
| Large (L) | ~32.3M | 55.0 | (Default) High-fidelity detection. | Server-side processing and high-precision robotics. |
| XLarge (X) | ~69.9M | 57.5 | Maximum accuracy available. | Non-real-time analysis and maximum-precision tasks. |
[!NOTE] Model accuracy numbers retrieved from YOLO26 docs
Operating Modes: Prompt-Free vs. Promptable
Prompt-Free Mode (DetectorType::PromptFree)
- Mechanism: Uses a built-in vocabulary of 4,585 classes based on the RAM++ tag set.
- Characteristics: This mode is highly efficient as it does not require external text encoding. It functions like a traditional YOLO model but with a much larger pre-defined label set.
- Constraint: You are limited to the built-in vocabulary; you cannot define custom classes at runtime in this mode.
Promptable Mode (DetectorType::Promptable)
- Mechanism: Uses a text-alignment module to compare image features against CLIP text embeddings.
- Characteristics: This mode provides high flexibility. You can prompt the model with specific strings such as
"blue toaster"or"peace symbol". - Constraint: Requires a CLIP model (handled automatically via
open_clip_inference) to generate embeddings for your labels, which adds a small initial overhead.
Task Selection: Mask (Segmentation) vs. Detection
You can toggle the include_mask(bool) parameter during builder initialization.
- Instance Segmentation (
include_mask(true)): Includes a pixel-levelObjectMaskfor every detected object. This is useful for tasks like background removal, object isolation, or precise spatial analysis. - Object Detection (
include_mask(false)): Returns only the bounding boxes, tags and scores for detected objects. - Performance Impact:
- On GPU, detecting only bounding boxes is approximately 15-33% faster than segmentation.
- On CPU, the difference is more significant, running up to 2x faster when masks are disabled.
Understanding Thresholds
Fine-tuning the detection thresholds is essential for balancing precision (reducing false positives) and recall (finding all objects).
Confidence Threshold (confidence_threshold)
This parameter determines the minimum certainty required for a detection to be returned.
- Default:
0.25 - Low values (e.g., 0.1): The model will be more sensitive and detect more objects, but it will also return more " noise" or incorrect detections.
- High values (e.g., 0.7): The model will only return objects it is very certain about. This reduces false positives but may cause it to miss partially obscured or small objects.
Intersection Over Union (intersection_over_union)
This parameter controls the Non-Maximum Suppression (NMS) process, which decides how to handle multiple overlapping boxes for the same object.
- Default:
0.7 - Mechanism: When two boxes overlap significantly, the IOU value measures the ratio of the overlap area to the total area of both boxes. If the IOU is higher than your threshold, the box with the lower confidence score is discarded.
- High values (e.g., 0.8): More tolerant of overlapping boxes. Useful in crowded scenes where objects are physically close to each other.
- Low values (e.g., 0.3): More aggressive at removing overlaps. Useful if the model is producing "double" detections for a single object.
Cargo Features
| Feature | Default | Description | Dependencies |
|---|---|---|---|
promptable |
Yes | Enable the Promptable detector and CLIP embeddings. | open_clip_inference |
hf-hub |
Yes | Enable automatic model downloading from Hugging Face. | hf-hub, tokio |
serde |
Yes | Enable Serialize/Deserialize for detection structs. |
serde |
The main ort Cargo features are also forwarded.
- ORT Cargo features: https://ort.pyke.io/setup/cargo-features
- ORT Execution providers: https://ort.pyke.io/perf/execution-providers
Troubleshooting
Link error - ORT
If a link error happens while building, this is probably due to ORT. You can try the load-dynamic Cargo feature to
resolve this. You'll need to point to an instance of the ONNXRuntime library on your system via an environment variable.
See the next section for more info.
[When using load-dynamic feature] ONNX Runtime Library Not Found
OnnxRuntime is dynamically loaded, so if it's not found correctly, then download the correct onnxruntime library from GitHub Releases.
Then put the dll/so/dylib location in your PATH, or point the ORT_DYLIB_PATH env var to it.
Output samples
Car park - no masks - prompted ["car"]

StreetView - with mask - prompt free

Cat - with mask - prompted ["cat"]
