NumKong for Rust
NumKong's Rust crate keeps most of the native kernel surface while expressing it in Rust-native terms.
Rust is the cleanest place to use NumKong when you want static typing, explicit ownership, and strong container APIs without giving up mixed precision.
Traits cover scalar metric families.
Tensor, Vector, and packed matrix types cover the higher-level workflows.
Custom allocators, low-precision storage wrappers, and explicit row-contiguity checks stay visible instead of being hidden behind a dynamic runtime.
The crate makes the storage policy and result promotion visible.
That matters for fp16, bf16, fp8, packed bits, and strided reductions.
Quickstart
use ;
Highlights
This is the most complete high-level SDK after Python. It is the best fit if you want most of the native breadth without dropping into a manual FFI layer.
Trait-first scalar API.
Type::operation(&a, &b) stays compact and predictable.
Allocator-aware tensors.
Tensor, PackedMatrix, and MaxSimPackedMatrix can use custom allocators.
Storage-first low precision.
f16, bf16, fp8, fp6, and packed integer wrappers are first-class types.
Matrix kernels with explicit contracts.
Packed and symmetric kernels validate shapes and row contiguity.
No hidden thread pool.
Parallel helpers remain host-controlled.
Fork Union support.
The parallel feature is the intended native orchestration layer.
Ecosystem Comparison
| Feature | NumKong | nalgebra | ndarray |
|---|---|---|---|
| Operation families | dots, distances, binary, probability, geospatial, curved, mesh, sparse, MaxSim, elementwise, reductions, cast, trig | linear algebra, decompositions | general n-dimensional arithmetic |
| Precision | BFloat16 through sub-byte; automatic widening; Kahan summation; 0 ULP in Float32/Float64 | Float32/Float64 only; no widening; standard accuracy | Float32/Float64 only; no widening; standard accuracy |
| Runtime SIMD dispatch | auto-selects best ISA per-thread at runtime across x86, ARM, RISC-V | none | none |
| Packed matrix, GEMM-like | pack once, reuse across query batches | standard matmul; no persistent packing | dot for matmul; no persistent packing |
| Symmetric kernels, SYRK-like | skip duplicate pairs, up to 2x speedup for self-distance | no duplicate-pair skipping | no duplicate-pair skipping |
| Memory model | Caller-owned; Tensor/PackedMatrix support custom allocators |
Heap-allocated matrices; custom storage trait | Heap-allocated; no custom allocator support |
| Host-side parallelism | row-range partitioning via reusable ThreadPool; no hidden threads |
Rayon-based parallelism possible | Rayon-based parallelism possible |
NumKong validates f16 and bf16 interop against the half crate in its own test suite.
That lets you move between ecosystem-standard half types and NumKong's kernel-facing wrappers without ambiguity.
Installation
Minimal:
[]
= "6.5.15"
With host-side parallel helpers:
[]
= { = "6.5.15", = ["parallel", "std"] }
Compilation and Backend Selection
The crate uses the cc build system to compile the C backend with NK_DYNAMIC_DISPATCH=1 automatically.
All supported backends for the target architecture are compiled into a single binary and selected at runtime.
The two Cargo features are std, which enables standard library support, and parallel, which adds host-side orchestration via Fork Union and implies std.
Backend selection follows the target architecture. ARM gets NEON, SVE, and SME, with SME available on Linux, FreeBSD, and macOS. x86-64 gets Haswell (AVX2), Skylake/Icelake/Sapphire Rapids AVX-512 variants, and AMX on Linux and Windows only. RISC-V gets RVV backends on Linux and FreeBSD. WASM gets relaxed v128.
Individual backends can be disabled through environment variables.
Any NK_TARGET_* variable set to 0 or false disables that backend.
Backends not explicitly disabled are enabled by default for the target platform.
NK_TARGET_NEON=0
NK_TARGET_SVE=0 NK_TARGET_SME=0
If a backend fails to compile, the build system automatically disables it and retries with the remaining backends. A warning is emitted for each disabled backend.
Dynamic Dispatch and Capabilities
configure_thread configures rounding behavior and enables CPU-specific acceleration features such as Intel AMX.
It must be called once per thread before using any SIMD-accelerated operations.
use ;
let caps = available;
configure_thread;
if caps & SAPPHIREAMX != 0
Call configure_thread at the start of every thread that will invoke NumKong kernels.
In a thread-pool setting, each worker thread needs its own call.
The function is idempotent and cheap to call more than once on the same thread.
Core Traits
The crate root re-exports the main metric families:
Dot,VDot,Angular,EuclideanHamming,JaccardKullbackLeibler,JensenShannonHaversine,VincentyBilinear,MahalanobisReduceMoments,ReduceMinMaxEachScale,EachSum,EachBlend,EachFMA
The standard call shape is:
use ;
let a = ;
let b = ;
let dot = f32dot.unwrap;
let bits_a = ;
let bits_b = ;
let jaccard = jaccard.unwrap;
let p = ;
let q = ;
let jsd = f32jensenshannon.unwrap;
println!;
Dot Products
Dot products span real, complex, quantized, and packed storage types.
use ;
let a = ;
let b = ;
let dot = dot.unwrap;
let vdot = vdot.unwrap; // like numpy.vdot, conjugated
println!;
Dense Distances
The dense spatial family covers sqeuclidean, euclidean, and angular.
The main value over naive loops is the combination of SIMD and safer accumulation policy.
use Euclidean;
let a = ;
let b = ;
let distance = i8euclidean.unwrap; // widened output, not int8
println!;
Scalar Types and Promotions
The scalar wrappers are storage-first types.
They are not decorative aliases over f32.
| Type | Layout | Bytes | Range | Inf | NaN |
|---|---|---|---|---|---|
f16 |
1+5+10 | 2 | ±65504 | yes | yes |
bf16 |
1+8+7 | 2 | ±3.4×10³⁸ | yes | yes |
e4m3 |
1+4+3 | 1 | ±448 | no | yes |
e5m2 |
1+5+2 | 1 | ±57344 | yes | yes |
e2m3 |
1+2+3 (6 bit) | 1 | ±7.5 | no | no |
e3m2 |
1+3+2 (6 bit) | 1 | ±28 | no | no |
u1x8 |
8 packed bits | 1 | 0–1 per bit | — | — |
u4x2 |
2×4-bit uint | 1 | 0–15 per nib | — | — |
i4x2 |
2×4-bit int | 1 | −8–7 per nib | — | — |
The trait hierarchy documents intent:
StorageElement— raw storable element type.NumberLike— adds numeric conversion and ordering.FloatConvertible— adds unpacking and float-domain conversion.
The output type is intentionally wider than the storage type for many operations.
For example, i8::dot returns i32.
f32::dot returns a wider accumulator type.
Moments reductions widen even more aggressively.
Set Similarity
Packed-binary metrics work on packed words instead of boolean slices. That is the right model once the workload is "semantic hash" rather than "array of booleans".
use ;
let a = ;
let b = ;
let hamming = hamming.unwrap;
let jaccard = jaccard.unwrap;
Integer set Jaccard works on sorted arrays of integer identifiers.
let set_a = ;
let set_b = ;
let jaccard_sets = u32jaccard.unwrap;
assert!; // |A ∩ B| / |A ∪ B|
Probability Metrics
use ;
let p = , q = ;
let kl_forward = f32kullbackleibler.unwrap;
let kl_reverse = f32kullbackleibler.unwrap;
assert!; // KLD is asymmetric
let js_forward = f32jensenshannon.unwrap;
let js_reverse = f32jensenshannon.unwrap;
assert!;
Geospatial Metrics
Inputs are latitudes and longitudes in radians. Outputs are meters.
use ;
// Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
let liberty_lat = , liberty_lon = ;
let big_ben_lat = , big_ben_lon = ;
let mut distance = ;
f64vincenty.unwrap; // ≈ 5,589,857 m (ellipsoidal, baseline)
f64haversine.unwrap; // ≈ 5,543,723 m (spherical, ~46 km less)
// Vincenty in f32 — drifts ~2 m from f64
let liberty_lat32 = , liberty_lon32 = ;
let big_ben_lat32 = , big_ben_lon32 = ;
let mut distance_f32 = ;
f32vincenty.unwrap; // ≈ 5,589,859 m (+2 m drift)
Curved Metrics
Curved-space kernels combine vectors with an extra metric tensor or covariance inverse.
use ;
// Complex bilinear form: aᴴ M b
let a = ;
let b = ;
let metric = ;
let bilinear = bilinear.unwrap;
// Real Mahalanobis distance: √((a−b)ᵀ M⁻¹ (a−b))
let x = ;
let y = ;
let mut inv_cov = vec!;
for i in 0..32 // identity matrix
let distance = f32mahalanobis.unwrap;
Vectors, Tensors, Views, and Spans
The container model is unusual enough that it needs direct documentation.
Vector<T>owns one-dimensional storage.VectorView<'a, T>is an immutable borrowed view.VectorSpan<'a, T>is a mutable borrowed view.Tensor<T, A, MAX_RANK>owns N-dimensional storage and can use a custom allocator.TensorViewandTensorSpanare the borrowed forms.Matrix<T>is a rank-2 alias overTensor<T, _, 2>.
The allocator story is explicit.
Tensor and PackedMatrix default to Global.
The underlying layout uses SIMD_ALIGNMENT == 64 for owned allocations.
That does not mean callers must align their source buffers manually.
It means owned outputs and packed payloads are allocated in a SIMD-friendly way when the crate owns them.
use ;
let t = try_from_slice.unwrap;
assert_eq!;
assert_eq!;
// Transpose
let view = t.view.transpose;
assert_eq!;
// Column slicing — produces a strided view
let col1 = t.view.slice.unwrap;
assert_eq!;
// Scalar element access
let val = t.view.get.unwrap; // row 1, col 2 → 5.0
// Reduction on a sliced view
let idx = col1.try_argmin_all.unwrap; // index of the minimum in the second column
The main layout rules are:
- General slicing and transposition are supported by views.
- Elementwise and many reduction kernels accept strided views.
- Matrix-style kernels require rank-2 inputs with contiguous rows.
- A tensor can be non-contiguous overall and still have contiguous rows.
- Some reductions have SIMD kernels for strided lanes.
- Some backends still fall back depending on alignment and dtype.
Elementwise Operations
Elementwise kernels live on tensors and views. They are not a promise that every arbitrary strided view gets the same SIMD path on every backend.
use Tensor;
let a = try_from_slice.unwrap;
let b = try_full.unwrap;
let blended = a.view.try_blend_tensor.unwrap;
let sines = blended.sin.unwrap;
assert_eq!;
Trigonometry
The trigonometric kernels share the tensor and view surface. They are useful both directly and as a sanity check that the container path is not just about matrix kernels.
use Tensor;
let a = try_from_slice.unwrap;
let c = a.cos.unwrap;
let s = a.sin.unwrap;
assert_eq!;
assert_eq!;
Moments Reductions
Moments reductions return both sum and sum-of-squares. That is the right building block for norms and variance-like workflows.
use ;
let narrow = try_full.unwrap;
let = narrow.try_moments_all.unwrap;
assert!; // a naive u8 accumulation would overflow immediately
assert!; // same for sum-of-squares
The important documentation point is not just "wider outputs exist". It is that the API makes the widened outputs part of the type story.
Min/Max Reductions
Min/max reductions are a separate family because they stress layout differently. They also expose the unusual strength of NumKong's reduction backends: some strided lanes still hit SIMD kernels.
use Tensor;
let t = try_from_slice.unwrap;
let second_column = t.view.slice.unwrap;
let idx = second_column.try_argmin_all.unwrap;
assert_eq!;
Sparse Operations and Intersections
Sparse helpers cover both sorted-index intersection and weighted sparse dot products.
use ;
let a_idx = , b_idx = ;
let count = u32sparse_intersection_size;
assert_eq!; // indices 3 and 5
let a_weights = , b_weights = ;
let dot = u32sparse_dot.unwrap;
assert!; // weighted dot over shared indices
Packed Matrix Kernels for GEMM-Like Workloads
Packed kernels are the main "matrix throughput" path in the crate.
They are GEMM-like in workload shape.
They are not a thin BLAS clone.
use ;
let a = try_full.unwrap;
let b = try_full.unwrap;
let b_packed = try_pack.unwrap;
let c = a.dots_packed;
assert_eq!;
The useful economics are:
- pack
Bonce - reuse it across many
Abatches - convert or pad once during packing instead of on every multiply
- reuse precomputed norms for
angulars_packedandeuclideans_packed
The crate checks row contiguity because these kernels assume contiguous rows. Caller-side source alignment is not required. The owned packed buffer handles its own aligned allocation internally.
Symmetric Kernels for SYRK-Like Workloads
Symmetric kernels are for self-similarity and self-distance.
They are SYRK-like in shape.
They avoid duplicate (i, j) and (j, i) work.
use Tensor;
let vectors = try_full.unwrap;
let gram = vectors.view.try_dots_symmetric.unwrap;
assert_eq!;
This family is also where row-window partitioning becomes the natural parallel model.
That is structurally different from packed GEMM-style work against a shared packed RHS.
MaxSim and ColBERT-Style Late Interaction
MaxSim is the late-interaction primitive used by systems such as ColBERT. It is not "just another matrix multiply".
use ;
let queries = try_full.unwrap;
let docs = try_full.unwrap;
let queries_packed = queries.view.try_maxsim_pack.unwrap;
let docs_packed = docs.view.try_maxsim_pack.unwrap;
let score = queries_packed.score;
assert!;
Geometric Mesh Alignment
Mesh alignment returns transforms, scales, and RMSD values. That is a different API shape from the scalar metric families.
use MeshAlignment;
let source = ;
let target = ;
let result = f32kabsch.unwrap;
assert!;
assert!;
// Umeyama with known 2x scaling
let scaled = ;
let result = f32umeyama.unwrap;
assert!;
assert!;
Parallelism and Fork Union
NumKong does not own a thread pool.
The parallel feature adds host-side orchestration helpers via Fork Union, not a hidden scheduler.
use ;
use ThreadPool;
let a = try_full.unwrap;
let b = try_full.unwrap;
let mut pool = try_spawn.unwrap;
// GEMM-like: rows of A partitioned across threads, one shared packed B
let b_packed = try_pack.unwrap;
let c = a.dots_packed_parallel;
assert_eq!;
// SYRK-like: row windows of one square output partitioned across threads
let gram = a.dots_symmetric_parallel;
assert_eq!;
Rayon or a manual thread pool can still work if the rest of your application already depends on them.