NumKong for Rust
NumKong's Rust crate keeps most of the native kernel surface while expressing it in Rust-native terms.
Rust is a natural fit for NumKong when you want static typing, explicit ownership, and strong container APIs without giving up mixed precision.
Traits cover scalar metric families.
Tensor, Vector, and packed matrix types cover the higher-level workflows.
Custom allocators, low-precision storage wrappers, and explicit row-contiguity checks stay visible instead of being hidden behind a dynamic runtime.
The crate makes the storage policy and result promotion visible.
That matters for fp16, bf16, fp8, packed bits, and strided reductions.
Quickstart
use ;
Highlights
This is the most fully featured high-level SDK after Python. It is a good fit if you want most of the native breadth without dropping into a manual FFI layer.
Trait-first scalar API.
Type::operation(&a, &b) stays compact and predictable.
Allocator-aware tensors.
Tensor, PackedMatrix, and MaxSimPackedMatrix can use custom allocators.
Storage-first low precision.
f16, bf16, fp8, fp6, and packed integer wrappers are first-class types.
Matrix kernels with explicit contracts.
Packed and symmetric kernels validate shapes and row contiguity.
No hidden thread pool.
Parallel helpers remain host-controlled.
Fork Union support.
The parallel feature is the intended native orchestration layer.
Ecosystem Comparison
| Feature | NumKong | nalgebra | ndarray |
|---|---|---|---|
| Operation families | dots, distances, binary, probability, geospatial, curved, mesh, sparse, MaxSim, elementwise, reductions, cast, trig | linear algebra, decompositions | general n-dimensional arithmetic |
| Precision | BFloat16 through sub-byte; automatic widening; Kahan summation; 0 ULP in Float32/Float64 | Float32/Float64 only; no widening; standard accuracy | Float32/Float64 only; no widening; standard accuracy |
| Runtime SIMD dispatch | auto-selects best ISA per-thread at runtime across x86, ARM, RISC-V | none | none |
| Packed matrix, GEMM-like | pack once, reuse across query batches | standard matmul; no persistent packing | dot for matmul; no persistent packing |
| Symmetric kernels, SYRK-like | skip duplicate pairs, up to 2x speedup for self-distance | no duplicate-pair skipping | no duplicate-pair skipping |
| Memory model | Caller-owned; Tensor/PackedMatrix support custom allocators |
Heap-allocated matrices; custom storage trait | Heap-allocated; no custom allocator support |
| Host-side parallelism | row-range partitioning via reusable ThreadPool; no hidden threads |
Rayon-based parallelism possible | Rayon-based parallelism possible |
NumKong validates f16 and bf16 interop against the half crate in its own test suite.
That lets you move between ecosystem-standard half types and NumKong's kernel-facing wrappers without ambiguity.
Installation
Minimal:
[]
= "7"
With host-side parallel helpers:
[]
= { = "7", = ["parallel", "std"] }
Compilation and Backend Selection
The crate uses the cc build system to compile the C backend with NK_DYNAMIC_DISPATCH=1 automatically.
All supported backends for the target architecture are compiled into a single binary and selected at runtime.
The two Cargo features are std, which enables standard library support, and parallel, which adds host-side orchestration via ForkUnion and implies std.
Backend selection follows the target architecture. ARM gets NEON, SVE, and SME, with SME available on Linux, FreeBSD, and macOS. x86-64 gets Haswell (AVX2), Skylake/Icelake/Sapphire Rapids AVX-512 variants, and AMX on Linux and Windows only. RISC-V gets RVV backends on Linux and FreeBSD. WASM gets relaxed v128.
Individual backends can be disabled through environment variables.
Any NK_TARGET_* variable set to 0 or false disables that backend.
Backends not explicitly disabled are enabled by default for the target platform.
NK_TARGET_NEON=0
NK_TARGET_SVE=0 NK_TARGET_SME=0
If a backend fails to compile, the build system automatically disables it and retries with the remaining backends. A warning is emitted for each disabled backend.
Dynamic Dispatch and Capabilities
configure_thread configures rounding behavior and enables CPU-specific acceleration features such as Intel AMX.
It must be called once per thread before using any SIMD-accelerated operations.
use ;
let caps = available;
configure_thread;
if caps & SAPPHIREAMX != 0
Call configure_thread at the start of every thread that will invoke NumKong kernels.
In a thread-pool setting, each worker thread needs its own call.
The function is idempotent and cheap to call more than once on the same thread.
Core Traits
The crate root re-exports the main metric families:
Dot,VDot,Angular,EuclideanHamming,JaccardKullbackLeibler,JensenShannonHaversine,VincentyBilinear,MahalanobisReduceMoments,ReduceMinMaxEachScale,EachSum,EachBlend,EachFMA
The standard call shape is:
use ;
let a = ;
let b = ;
let dot = f32dot.unwrap;
let bits_a = ;
let bits_b = ;
let jaccard = jaccard.unwrap;
let p = ;
let q = ;
let jsd = f32jensenshannon.unwrap;
println!;
Dot Products
Dot products span real, complex, quantized, and packed storage types.
use ;
let a = ;
let b = ;
let dot = dot.unwrap;
let vdot = vdot.unwrap; // like numpy.vdot, conjugated
println!;
Dense Distances
The dense spatial family covers sqeuclidean, euclidean, and angular.
The main value over naive loops is the combination of SIMD and safer accumulation policy.
use Euclidean;
let a = ;
let b = ;
let distance = i8euclidean.unwrap; // widened output, not int8
println!;
Scalar Types and Promotions
The scalar wrappers are storage-first types.
They are not decorative aliases over f32.
| Type | Layout | Bytes | Range | Inf | NaN |
|---|---|---|---|---|---|
f16 |
1+5+10 | 2 | ±65504 | yes | yes |
bf16 |
1+8+7 | 2 | ±3.4×10³⁸ | yes | yes |
e4m3 |
1+4+3 | 1 | ±448 | no | yes |
e5m2 |
1+5+2 | 1 | ±57344 | yes | yes |
e2m3 |
1+2+3 (6 bit) | 1 | ±7.5 | no | no |
e3m2 |
1+3+2 (6 bit) | 1 | ±28 | no | no |
u1x8 |
8 packed bits | 1 | 0–1 per bit | — | — |
u4x2 |
2×4-bit uint | 1 | 0–15 per nib | — | — |
i4x2 |
2×4-bit int | 1 | −8–7 per nib | — | — |
The trait hierarchy documents intent:
StorageElement— raw storable element type.NumberLike— adds numeric conversion and ordering.FloatConvertible— adds unpacking and float-domain conversion.
The output type is intentionally wider than the storage type for many operations.
For example, i8::dot returns i32.
f32::dot returns a wider accumulator type.
Moments reductions widen even more aggressively.
Set Similarity
Packed-binary metrics work on packed words instead of boolean slices. That is the right model once the workload is "semantic hash" rather than "array of booleans".
use ;
let a = ;
let b = ;
let hamming = hamming.unwrap;
let jaccard = jaccard.unwrap;
Integer set Jaccard works on sorted arrays of integer identifiers.
let set_a = ;
let set_b = ;
let jaccard_sets = u32jaccard.unwrap;
assert!; // |A ∩ B| / |A ∪ B|
Probability Metrics
use ;
let p = , q = ;
let kl_forward = f32kullbackleibler.unwrap;
let kl_reverse = f32kullbackleibler.unwrap;
assert!; // KLD is asymmetric
let js_forward = f32jensenshannon.unwrap;
let js_reverse = f32jensenshannon.unwrap;
assert!;
Geospatial Metrics
Inputs are latitudes and longitudes in radians. Outputs are meters.
use ;
// Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
let liberty_lat = , liberty_lon = ;
let big_ben_lat = , big_ben_lon = ;
let mut distance = ;
f64vincenty.unwrap; // ≈ 5,589,857 m (ellipsoidal, baseline)
f64haversine.unwrap; // ≈ 5,543,723 m (spherical, ~46 km less)
// Vincenty in f32 — drifts ~2 m from f64
let liberty_lat32 = , liberty_lon32 = ;
let big_ben_lat32 = , big_ben_lon32 = ;
let mut distance_f32 = ;
f32vincenty.unwrap; // ≈ 5,589,859 m (+2 m drift)
Curved Metrics
Curved-space kernels combine vectors with an extra metric tensor or covariance inverse.
use ;
// Complex bilinear form: aᴴ M b
let a = ;
let b = ;
let metric = ;
let bilinear = bilinear.unwrap;
// Real Mahalanobis distance: √((a−b)ᵀ M⁻¹ (a−b))
let x = ;
let y = ;
let mut inv_cov = vec!;
for i in 0..32 // identity matrix
let distance = f32mahalanobis.unwrap;
Vectors, Tensors, Views, and Spans
The container model is unusual enough that it needs direct documentation.
Vector<T>owns one-dimensional storage.VectorView<'a, T>is an immutable borrowed view.VectorSpan<'a, T>is a mutable borrowed view.Tensor<T, A, MAX_RANK>owns N-dimensional storage and can use a custom allocator.TensorViewandTensorSpanare the borrowed forms.Matrix<T>is a rank-2 alias overTensor<T, _, 2>.
The allocator story is explicit.
Tensor and PackedMatrix default to Global.
The underlying layout uses SIMD_ALIGNMENT == 64 for owned allocations.
That does not mean callers must align their source buffers manually.
It means owned outputs and packed payloads are allocated in a SIMD-friendly way when the crate owns them.
use ;
let t = try_from_slice.unwrap;
let col = t.slice.unwrap; // t[:, 1] — column 1
let rows = t.slice.unwrap; // t[0:2, :] — first two rows
let tail = t.slice.unwrap; // t[-2:, :] — last two rows
let neg = t.slice.unwrap; // t[:, -2:-1]
let step = t.slice.unwrap; // t[:, ::2]
// Explicit &[SliceRange] syntax also works
let col = t.slice.unwrap;
Tuple elements implement SliceArg — each monomorphized with zero runtime dispatch:
| Rust syntax | Meaning |
|---|---|
.. |
all |
0_usize / -1_isize |
single index (negative wraps from end) |
1..4_usize / -3..-1_isize |
half-open range |
..3_usize / ..-1_isize |
from start |
1_usize.. / -2_isize.. |
to end |
0..=2_usize / -3..=-1_isize |
inclusive range |
RangeStep::new(0, 6, 2) |
stepped (no Rust literal) |
Integer literals default to i32 — use _usize / _isize suffixes.
Negative isize values wrap from the dimension end, like Python.
Iteration works at the logical-dimension level.
For sub-byte types like i4x2 (2 nibbles per byte), iterating a 3-element vector yields 6 dimensions.
Immutable iterators (iter()) yield DimRef<T>, which dereferences to T::DimScalar.
Mutable iterators (iter_mut()) yield DimMut<T>, which writes back on drop — the only way to mutate individual nibbles or bits.
use ;
let mut nibbles = try_zeros.unwrap;
for in nibbles.iter_mut.enumerate
assert_eq!;
assert_eq!;
Vectors and tensors can be converted between each other without copying:
use ;
let v = try_from_scalars.unwrap;
let t: = v.try_into_tensor.unwrap;
assert_eq!;
let v2 = t.try_into_vector.unwrap;
assert_eq!;
The main layout rules are:
- General slicing and transposition are supported by views.
- Elementwise and many reduction kernels accept strided views.
- Matrix-style kernels require rank-2 inputs with contiguous rows.
- A tensor can be non-contiguous overall and still have contiguous rows.
- Some reductions have SIMD kernels for strided lanes.
- Some backends still fall back depending on alignment and dtype.
Sub-byte types (i4x2, u4x2, u1x8) use logical shapes.
A shape of [8] for i4x2 means 8 nibbles (stored in 4 bytes), not 8 bytes.
The innermost dimension must be divisible by dimensions_per_value() (2 for nibble types, 8 for bit types).
Transpose and reshape are not supported for sub-byte types — they return SubByteUnsupported.
Elementwise Operations
Elementwise kernels live on tensors and views. They are not a promise that every arbitrary strided view gets the same SIMD path on every backend.
use Tensor;
let a = try_from_slice.unwrap;
let b = try_full.unwrap;
let blended = a.view.try_blend_tensor.unwrap;
let sines = blended.sin.unwrap;
assert_eq!;
Compound assignment operators work in-place:
use Tensor;
let mut t = try_full.unwrap;
t += 10.0;
t -= 0.5;
t *= 2.0;
Trigonometry
The trigonometric kernels share the tensor and view surface. They are useful both directly and as a sanity check that the container path is not just about matrix kernels.
use Tensor;
let a = try_from_slice.unwrap;
let c = a.cos.unwrap;
let s = a.sin.unwrap;
assert_eq!;
assert_eq!;
Moments Reductions
Moments reductions return both sum and sum-of-squares. That is the right building block for norms and variance-like workflows.
use ;
let narrow = try_full.unwrap;
let = narrow.try_moments_all.unwrap;
assert!; // a naive u8 accumulation would overflow immediately
assert!; // same for sum-of-squares
The important documentation point is not just "wider outputs exist". It is that the API makes the widened outputs part of the type story.
Min/Max Reductions
Min/max reductions return a MinMaxResult with both the value and its flat index:
use Tensor;
let t = try_from_slice.unwrap;
let second_column = t.slice.unwrap; // t[:, 1]
let idx = second_column.try_argmin_all.unwrap;
assert_eq!;
Sparse Operations and Intersections
Sparse helpers cover both sorted-index intersection and weighted sparse dot products.
use ;
let a_idx = , b_idx = ;
let count = u32sparse_intersection_size;
assert_eq!; // indices 3 and 5
let a_weights = , b_weights = ;
let dot = u32sparse_dot.unwrap;
assert!; // weighted dot over shared indices
Packed Matrix Kernels for GEMM-Like Workloads
Packed kernels are the main "matrix throughput" path in the crate.
They are GEMM-like in workload shape.
They are not a thin BLAS clone.
use ;
let a = try_full.unwrap;
let b = try_full.unwrap;
let b_packed = try_pack.unwrap;
let c = a.dots_packed;
assert_eq!;
The useful economics are:
- pack
Bonce - reuse it across many
Abatches - convert or pad once during packing instead of on every multiply
- reuse precomputed norms for
angulars_packedandeuclideans_packed
The crate checks row contiguity because these kernels assume contiguous rows. Caller-side source alignment is not required. The owned packed buffer handles its own aligned allocation internally.
Symmetric Kernels for SYRK-Like Workloads
Symmetric kernels are for self-similarity and self-distance.
They are SYRK-like in shape.
They avoid duplicate (i, j) and (j, i) work.
use Tensor;
let vectors = try_full.unwrap;
let gram = vectors.view.try_dots_symmetric.unwrap;
assert_eq!;
This family is also where row-window partitioning becomes the natural parallel model.
That is structurally different from packed GEMM-style work against a shared packed RHS.
MaxSim and ColBERT-Style Late Interaction
MaxSim is the late-interaction primitive used by systems such as ColBERT. It is not "just another matrix multiply".
use ;
let queries = try_full.unwrap;
let docs = try_full.unwrap;
let queries_packed = queries.view.try_maxsim_pack.unwrap;
let docs_packed = docs.view.try_maxsim_pack.unwrap;
let score = queries_packed.score;
assert!;
Geometric Mesh Alignment
Mesh alignment returns transforms, scales, and RMSD values. That is a different API shape from the scalar metric families.
use MeshAlignment;
let source = ;
let target = ;
let result = f32kabsch.unwrap;
assert!;
assert!;
// Umeyama with known 2x scaling
let scaled = ;
let result = f32umeyama.unwrap;
assert!;
assert!;
Tolerance Comparison
Exact floating-point equality is rarely what you want after arithmetic.
allclose() checks every element pair with the formula:
$$ |a - b| \leq \text{atol} + \text{rtol} \cdot |b| $$
Available on Vector, VectorView, VectorSpan, Tensor, TensorView, and TensorSpan.
For tensors, allclose is provided by the AllCloseOps trait — import it if calling on a TensorRef implementor.
Shape mismatch returns false.
The scalar helper is_close is re-exported at crate root.
use ;
// Scalar check
assert!;
// Vector tolerance check
let a = try_full.unwrap;
let b = try_full.unwrap;
assert!;
// Tensor tolerance check
let ta = try_full.unwrap;
let tb = try_full.unwrap;
assert!;
Type Casting
The cast function performs bulk conversion between contiguous slices.
Any pair of types that implement CastDtype (all NumberLike scalars) can be converted.
use ;
let src: = vec!;
let mut dst: = vec!;
cast;
assert!;
Tensor, TensorView, and TensorSpan expose casting via the CastOps trait.
try_cast_dtype() allocates a new tensor; try_cast_dtype_into() writes into a pre-allocated TensorSpan.
Strided and non-contiguous views are supported: the implementation scans strides from the innermost dimension outward to find the longest contiguous tail, then walks the outer dimensions and casts each contiguous block in a single kernel call.
use ;
let src = try_full.unwrap;
let mut dst = try_zeros.unwrap;
src.view.try_cast_dtype_into.unwrap;
Parallelism and ForkUnion
NumKong does not own a thread pool.
The parallel feature adds host-side orchestration helpers via ForkUnion, not a hidden scheduler.
use ;
use ThreadPool;
let a = try_full.unwrap;
let b = try_full.unwrap;
let mut pool = try_spawn.unwrap;
// GEMM-like: rows of A partitioned across threads, one shared packed B
let b_packed = try_pack.unwrap;
let c = a.dots_packed_parallel;
assert_eq!;
// SYRK-like: row windows of one square output partitioned across threads
let gram = a.dots_symmetric_parallel;
assert_eq!;
Rayon or a manual thread pool can still work if the rest of your application already depends on them.
Addressing External Memory
Views wrap raw pointers without ownership, owned containers accept custom allocators, and the scalar trait API works on any &[T] regardless of how the memory was allocated.
VectorView::from_raw_parts and TensorView::from_raw_parts wrap device-accessible or externally allocated memory.
The mutable counterparts VectorSpan::from_raw_parts and TensorSpan::from_raw_parts work the same way with *mut T.
use ;
let embeddings_ptr: *const f32 = /* from CUDA, mmap, or FFI */;
let embeddings = unsafe ;
let shape = ;
let strides = ; // row-major f32
let matrix = unsafe ;
Owned containers accept any allocator. A CUDA unified memory allocator looks like this:
use ;
use NonNull;
use Vector;
;
unsafe
let queries = try_zeros_in.unwrap;
The trait-based scalar API works on any &[T] — Vec, mmap, arena, or pinned buffer:
use Dot;
let weights: & = /* any contiguous slice */;
let similarity = f32dot.unwrap;