Expand description
§Next-Plaid ONNX
Fast ColBERT inference using ONNX Runtime with automatic hardware acceleration.
Also includes hierarchical clustering utilities compatible with scipy.
§Quick Start
ⓘ
use next_plaid_onnx::Colbert;
// Simple usage with defaults (auto-detects threads and hardware)
let model = Colbert::new("models/GTE-ModernColBERT-v1")?;
// Encode documents
let doc_embeddings = model.encode_documents(&["Paris is the capital of France."], None)?;
// Encode queries
let query_embeddings = model.encode_queries(&["What is the capital of France?"])?;§Configuration
Use the builder pattern for advanced configuration:
ⓘ
use next_plaid_onnx::{Colbert, ExecutionProvider};
let model = Colbert::builder("models/GTE-ModernColBERT-v1")
.with_quantized(true) // Use INT8 model for ~2x speedup
.with_parallel(25) // 25 parallel ONNX sessions
.with_batch_size(2) // Batch size per session
.with_execution_provider(ExecutionProvider::Cuda) // Force CUDA
.build()?;§Hardware Acceleration
Enable GPU acceleration by adding the appropriate feature:
cuda- NVIDIA CUDA (Linux/Windows)tensorrt- NVIDIA TensorRT (optimized CUDA)coreml- Apple Silicon (macOS)directml- Windows GPUs (DirectX 12)
When GPU features are enabled, the library automatically uses GPU if available and falls back to CPU if not.
Modules§
- hierarchy
- Hierarchical clustering implementation compatible with scipy.cluster.hierarchy.
Structs§
- Colbert
- ColBERT model for encoding documents and queries into multi-vector embeddings.
- Colbert
Builder - Builder for configuring
Colbert. - Colbert
Config - Configuration for ColBERT model behavior.
Enums§
- Execution
Provider - Hardware acceleration provider for ONNX Runtime.
Functions§
- is_
cuda_ available - Check if CUDA execution provider is available. Always returns false when CUDA feature is not enabled.
- is_
force_ cpu - Check if CPU-only mode is forced via environment variable.
Set
NEXT_PLAID_FORCE_CPU=1to completely disable CUDA and avoid any CUDA library loading overhead.