NumKong for Rust

NumKong's Rust crate keeps most of the native kernel surface while expressing it in Rust-native terms. Rust is the cleanest place to use NumKong when you want static typing, explicit ownership, and strong container APIs without giving up mixed precision. Traits cover scalar metric families. Tensor, Vector, and packed matrix types cover the higher-level workflows. Custom allocators, low-precision storage wrappers, and explicit row-contiguity checks stay visible instead of being hidden behind a dynamic runtime. The crate makes the storage policy and result promotion visible. That matters for fp16, bf16, fp8, packed bits, and strided reductions.

Quickstart

use numkong::{configure_thread, Dot};

fn main() {
    configure_thread();
    let a = [1.0_f32, 2.0, 3.0];
    let b = [4.0_f32, 5.0, 6.0];
    let dot = f32::dot(&a, &b).unwrap();
    println!("dot={dot}");
}

Highlights

This is the most complete high-level SDK after Python. It is the best fit if you want most of the native breadth without dropping into a manual FFI layer.

Trait-first scalar API. Type::operation(&a, &b) stays compact and predictable. Allocator-aware tensors. Tensor, PackedMatrix, and MaxSimPackedMatrix can use custom allocators. Storage-first low precision. f16, bf16, fp8, fp6, and packed integer wrappers are first-class types. Matrix kernels with explicit contracts. Packed and symmetric kernels validate shapes and row contiguity. No hidden thread pool. Parallel helpers remain host-controlled. Fork Union support. The parallel feature is the intended native orchestration layer.

Ecosystem Comparison

Feature	NumKong	nalgebra	ndarray
Operation families	dots, distances, binary, probability, geospatial, curved, mesh, sparse, MaxSim, elementwise, reductions, cast, trig	linear algebra, decompositions	general n-dimensional arithmetic
Precision	BFloat16 through sub-byte; automatic widening; Kahan summation; 0 ULP in Float32/Float64	Float32/Float64 only; no widening; standard accuracy	Float32/Float64 only; no widening; standard accuracy
Runtime SIMD dispatch	auto-selects best ISA per-thread at runtime across x86, ARM, RISC-V	none	none
Packed matrix, GEMM-like	pack once, reuse across query batches	standard matmul; no persistent packing	`dot` for matmul; no persistent packing
Symmetric kernels, SYRK-like	skip duplicate pairs, up to 2x speedup for self-distance	no duplicate-pair skipping	no duplicate-pair skipping
Memory model	Caller-owned; `Tensor`/`PackedMatrix` support custom allocators	Heap-allocated matrices; custom storage trait	Heap-allocated; no custom allocator support
Host-side parallelism	row-range partitioning via reusable `ThreadPool`; no hidden threads	Rayon-based parallelism possible	Rayon-based parallelism possible

NumKong validates f16 and bf16 interop against the half crate in its own test suite. That lets you move between ecosystem-standard half types and NumKong's kernel-facing wrappers without ambiguity.

Installation

Minimal:

[dependencies]
numkong = "6.5.15"

With host-side parallel helpers:

[dependencies]
numkong = { version = "6.5.15", features = ["parallel", "std"] }

Compilation and Backend Selection

The crate uses the cc build system to compile the C backend with NK_DYNAMIC_DISPATCH=1 automatically. All supported backends for the target architecture are compiled into a single binary and selected at runtime.

The two Cargo features are std, which enables standard library support, and parallel, which adds host-side orchestration via Fork Union and implies std.

Backend selection follows the target architecture. ARM gets NEON, SVE, and SME, with SME available on Linux, FreeBSD, and macOS. x86-64 gets Haswell (AVX2), Skylake/Icelake/Sapphire Rapids AVX-512 variants, and AMX on Linux and Windows only. RISC-V gets RVV backends on Linux and FreeBSD. WASM gets relaxed v128.

Individual backends can be disabled through environment variables. Any NK_TARGET_* variable set to 0 or false disables that backend. Backends not explicitly disabled are enabled by default for the target platform.

NK_TARGET_NEON=0 cargo build
NK_TARGET_SVE=0 NK_TARGET_SME=0 cargo build

If a backend fails to compile, the build system automatically disables it and retries with the remaining backends. A warning is emitted for each disabled backend.

Dynamic Dispatch and Capabilities

configure_thread configures rounding behavior and enables CPU-specific acceleration features such as Intel AMX. It must be called once per thread before using any SIMD-accelerated operations.

use numkong::{available, configure_thread, cap};

let caps = available();
configure_thread();

if caps & cap::SAPPHIREAMX != 0 {
    println!("AMX available");
}

Call configure_thread at the start of every thread that will invoke NumKong kernels. In a thread-pool setting, each worker thread needs its own call. The function is idempotent and cheap to call more than once on the same thread.

Core Traits

The crate root re-exports the main metric families:

Dot, VDot, Angular, Euclidean
Hamming, Jaccard
KullbackLeibler, JensenShannon
Haversine, Vincenty
Bilinear, Mahalanobis
ReduceMoments, ReduceMinMax
EachScale, EachSum, EachBlend, EachFMA

The standard call shape is:

use numkong::{Dot, JensenShannon, Jaccard, u1x8};

let a = [1.0_f32, 2.0, 3.0];
let b = [4.0_f32, 5.0, 6.0];
let dot = f32::dot(&a, &b).unwrap();

let bits_a = [u1x8(0b11110000), u1x8(0b00001111)];
let bits_b = [u1x8(0b11110000), u1x8(0b11110000)];
let jaccard = u1x8::jaccard(&bits_a, &bits_b).unwrap();

let p = [0.2_f32, 0.3, 0.5];
let q = [0.1_f32, 0.3, 0.6];
let jsd = f32::jensenshannon(&p, &q).unwrap();

println!("{dot} {jaccard} {jsd}");

Dot Products

Dot products span real, complex, quantized, and packed storage types.

use numkong::{Dot, VDot, f32c};

let a = [f32c { re: 1.0, im: 2.0 }, f32c { re: 3.0, im: 4.0 }];
let b = [f32c { re: 5.0, im: 6.0 }, f32c { re: 7.0, im: 8.0 }];

let dot = f32c::dot(&a, &b).unwrap();
let vdot = f32c::vdot(&a, &b).unwrap(); // like numpy.vdot, conjugated

println!("{dot:?} {vdot:?}");

Dense Distances

The dense spatial family covers sqeuclidean, euclidean, and angular. The main value over naive loops is the combination of SIMD and safer accumulation policy.

use numkong::Euclidean;

let a = [1_i8, 2, 3, 4];
let b = [4_i8, 3, 2, 1];

let distance = i8::euclidean(&a, &b).unwrap(); // widened output, not int8
println!("{distance}");

Scalar Types and Promotions

The scalar wrappers are storage-first types. They are not decorative aliases over f32.

Type	Layout	Bytes	Range	Inf	NaN
`f16`	1+5+10	2	±65504	yes	yes
`bf16`	1+8+7	2	±3.4×10³⁸	yes	yes
`e4m3`	1+4+3	1	±448	no	yes
`e5m2`	1+5+2	1	±57344	yes	yes
`e2m3`	1+2+3 (6 bit)	1	±7.5	no	no
`e3m2`	1+3+2 (6 bit)	1	±28	no	no
`u1x8`	8 packed bits	1	0–1 per bit	—	—
`u4x2`	2×4-bit uint	1	0–15 per nib	—	—
`i4x2`	2×4-bit int	1	−8–7 per nib	—	—

The trait hierarchy documents intent:

StorageElement — raw storable element type.
NumberLike — adds numeric conversion and ordering.
FloatConvertible — adds unpacking and float-domain conversion.

The output type is intentionally wider than the storage type for many operations. For example, i8::dot returns i32. f32::dot returns a wider accumulator type. Moments reductions widen even more aggressively.

Set Similarity

Packed-binary metrics work on packed words instead of boolean slices. That is the right model once the workload is "semantic hash" rather than "array of booleans".

use numkong::{Hamming, Jaccard, u1x8};

let a = [u1x8(0b10101010), u1x8(0b11110000)];
let b = [u1x8(0b10101110), u1x8(0b11000000)];
let hamming = u1x8::hamming(&a, &b).unwrap();
let jaccard = u1x8::jaccard(&a, &b).unwrap();

Integer set Jaccard works on sorted arrays of integer identifiers.

let set_a = [1_u32, 3, 5, 7, 9];
let set_b = [3_u32, 5, 8, 9, 10];
let jaccard_sets = u32::jaccard(&set_a, &set_b).unwrap();
assert!(jaccard_sets > 0.0 && jaccard_sets < 1.0); // |A ∩ B| / |A ∪ B|

Probability Metrics

use numkong::{JensenShannon, KullbackLeibler};

let p = [0.2_f32, 0.3, 0.5], q = [0.1_f32, 0.3, 0.6];
let kl_forward = f32::kullbackleibler(&p, &q).unwrap();
let kl_reverse = f32::kullbackleibler(&q, &p).unwrap();
assert!(kl_forward != kl_reverse); // KLD is asymmetric

let js_forward = f32::jensenshannon(&p, &q).unwrap();
let js_reverse = f32::jensenshannon(&q, &p).unwrap();
assert!((js_forward - js_reverse).abs() < 1e-6, "JSD is symmetric");

Geospatial Metrics

Inputs are latitudes and longitudes in radians. Outputs are meters.

use numkong::{Haversine, Vincenty};

// Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
let liberty_lat = [0.7101605100_f64], liberty_lon = [-1.2923203180_f64];
let big_ben_lat = [0.8988567821_f64], big_ben_lon = [-0.0021746802_f64];
let mut distance = [0.0_f64; 1];
f64::vincenty(&liberty_lat, &liberty_lon, &big_ben_lat, &big_ben_lon, &mut distance).unwrap();  // ≈ 5,589,857 m (ellipsoidal, baseline)
f64::haversine(&liberty_lat, &liberty_lon, &big_ben_lat, &big_ben_lon, &mut distance).unwrap(); // ≈ 5,543,723 m (spherical, ~46 km less)

// Vincenty in f32 — drifts ~2 m from f64
let liberty_lat32 = [0.7101605100_f32], liberty_lon32 = [-1.2923203180_f32];
let big_ben_lat32 = [0.8988567821_f32], big_ben_lon32 = [-0.0021746802_f32];
let mut distance_f32 = [0.0_f32; 1];
f32::vincenty(&liberty_lat32, &liberty_lon32, &big_ben_lat32, &big_ben_lon32, &mut distance_f32).unwrap(); // ≈ 5,589,859 m (+2 m drift)

Curved Metrics

Curved-space kernels combine vectors with an extra metric tensor or covariance inverse.

use numkong::{Bilinear, Mahalanobis, f32c};

// Complex bilinear form: aᴴ M b
let a = [f32c { re: 1.0, im: 0.0 }; 16];
let b = [f32c { re: 0.0, im: 1.0 }; 16];
let metric = [f32c { re: 1.0, im: 0.0 }; 16 * 16];
let bilinear = f32c::bilinear(&a, &b, &metric).unwrap();

// Real Mahalanobis distance: √((a−b)ᵀ M⁻¹ (a−b))
let x = [1.0_f32; 32];
let y = [2.0_f32; 32];
let mut inv_cov = vec![0.0_f32; 32 * 32];
for i in 0..32 { inv_cov[i * 32 + i] = 1.0; } // identity matrix
let distance = f32::mahalanobis(&x, &y, &inv_cov).unwrap();

Vectors, Tensors, Views, and Spans

The container model is unusual enough that it needs direct documentation.

Vector<T> owns one-dimensional storage.
VectorView<'a, T> is an immutable borrowed view.
VectorSpan<'a, T> is a mutable borrowed view.
Tensor<T, A, MAX_RANK> owns N-dimensional storage and can use a custom allocator.
TensorView and TensorSpan are the borrowed forms.
Matrix<T> is a rank-2 alias over Tensor<T, _, 2>.

The allocator story is explicit. Tensor and PackedMatrix default to Global. The underlying layout uses SIMD_ALIGNMENT == 64 for owned allocations. That does not mean callers must align their source buffers manually. It means owned outputs and packed payloads are allocated in a SIMD-friendly way when the crate owns them.

use numkong::{SliceRange, Tensor};

let t = Tensor::<f32>::try_from_slice(&[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], &[3, 3]).unwrap();
assert_eq!(t.shape(), &[3, 3]);
assert_eq!(t.row(1), Some(&[3.0, 4.0, 5.0][..]));

// Transpose
let view = t.view().transpose();
assert_eq!(view.shape(), &[3, 3]);

// Column slicing — produces a strided view
let col1 = t.view().slice(&[SliceRange::full(), SliceRange::index(1)]).unwrap();
assert_eq!(col1.shape(), &[3]);

// Scalar element access
let val = t.view().get(&[1, 2]).unwrap(); // row 1, col 2 → 5.0

// Reduction on a sliced view
let idx = col1.try_argmin_all().unwrap(); // index of the minimum in the second column

The main layout rules are:

General slicing and transposition are supported by views.
Elementwise and many reduction kernels accept strided views.
Matrix-style kernels require rank-2 inputs with contiguous rows.
A tensor can be non-contiguous overall and still have contiguous rows.
Some reductions have SIMD kernels for strided lanes.
Some backends still fall back depending on alignment and dtype.

Elementwise Operations

Elementwise kernels live on tensors and views. They are not a promise that every arbitrary strided view gets the same SIMD path on every backend.

use numkong::Tensor;

let a = Tensor::<f32>::try_from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2]).unwrap();
let b = Tensor::<f32>::try_full(&[2, 2], 2.0).unwrap();

let blended = a.view().try_blend_tensor(&b.view(), 0.25, 0.75).unwrap();
let sines = blended.sin().unwrap();

assert_eq!(sines.shape(), &[2, 2]);

Trigonometry

The trigonometric kernels share the tensor and view surface. They are useful both directly and as a sanity check that the container path is not just about matrix kernels.

use numkong::Tensor;

let a = Tensor::<f32>::try_from_slice(&[0.0, 1.0, 2.0, 3.0], &[2, 2]).unwrap();
let c = a.cos().unwrap();
let s = a.sin().unwrap();

assert_eq!(c.shape(), &[2, 2]);
assert_eq!(s.shape(), &[2, 2]);

Moments Reductions

Moments reductions return both sum and sum-of-squares. That is the right building block for norms and variance-like workflows.

use numkong::{ReduceMoments, Tensor};

let narrow = Tensor::<u8>::try_full(&[1024], 255).unwrap();
let (sum, sumsq) = narrow.try_moments_all().unwrap();

assert!(sum > 255);      // a naive u8 accumulation would overflow immediately
assert!(sumsq > 255u64); // same for sum-of-squares

The important documentation point is not just "wider outputs exist". It is that the API makes the widened outputs part of the type story.

Min/Max Reductions

Min/max reductions are a separate family because they stress layout differently. They also expose the unusual strength of NumKong's reduction backends: some strided lanes still hit SIMD kernels.

use numkong::Tensor;

let t = Tensor::<f32>::try_from_slice(&[
    3.0, 0.0, 7.0,
    1.0, 2.0, 5.0,
    4.0, -1.0, 6.0,
], &[3, 3]).unwrap();

let second_column = t.view().slice(&[SliceRange::full(), SliceRange::index(1)]).unwrap();
let idx = second_column.try_argmin_all().unwrap();

assert_eq!(idx, 2);

Sparse Operations and Intersections

Sparse helpers cover both sorted-index intersection and weighted sparse dot products.

use numkong::{SparseIntersect, SparseDot};

let a_idx = [1_u32, 3, 5, 7], b_idx = [3_u32, 4, 5, 8];
let count = u32::sparse_intersection_size(&a_idx, &b_idx);
assert_eq!(count, 2); // indices 3 and 5

let a_weights = [1.0_f32, 2.0, 3.0, 4.0], b_weights = [5.0_f32, 6.0, 7.0, 8.0];
let dot = u32::sparse_dot(&a_idx, &b_idx, &a_weights, &b_weights).unwrap();
assert!(dot > 0.0); // weighted dot over shared indices

Packed Matrix Kernels for GEMM-Like Workloads

Packed kernels are the main "matrix throughput" path in the crate. They are GEMM-like in workload shape. They are not a thin BLAS clone.

use numkong::{PackedMatrix, Tensor};

let a = Tensor::<f32>::try_full(&[1024, 512], 1.0).unwrap();
let b = Tensor::<f32>::try_full(&[256, 512], 1.0).unwrap();

let b_packed = PackedMatrix::try_pack(&b).unwrap();
let c = a.dots_packed(&b_packed);

assert_eq!(c.shape(), &[1024, 256]);

The useful economics are:

pack B once
reuse it across many A batches
convert or pad once during packing instead of on every multiply
reuse precomputed norms for angulars_packed and euclideans_packed

The crate checks row contiguity because these kernels assume contiguous rows. Caller-side source alignment is not required. The owned packed buffer handles its own aligned allocation internally.

Symmetric Kernels for SYRK-Like Workloads

Symmetric kernels are for self-similarity and self-distance. They are SYRK-like in shape. They avoid duplicate (i, j) and (j, i) work.

use numkong::Tensor;

let vectors = Tensor::<f32>::try_full(&[100, 768], 1.0).unwrap();
let gram = vectors.view().try_dots_symmetric().unwrap();

assert_eq!(gram.shape(), &[100, 100]);

This family is also where row-window partitioning becomes the natural parallel model. That is structurally different from packed GEMM-style work against a shared packed RHS.

MaxSim and ColBERT-Style Late Interaction

MaxSim is the late-interaction primitive used by systems such as ColBERT. It is not "just another matrix multiply".

use numkong::{MaxSimPackedMatrix, Tensor};

let queries = Tensor::<f32>::try_full(&[4, 16], 1.0).unwrap();
let docs = Tensor::<f32>::try_full(&[8, 16], 1.0).unwrap();

let queries_packed = queries.view().try_maxsim_pack().unwrap();
let docs_packed = docs.view().try_maxsim_pack().unwrap();
let score = queries_packed.score(&docs_packed);

assert!(score.is_finite());

Geometric Mesh Alignment

Mesh alignment returns transforms, scales, and RMSD values. That is a different API shape from the scalar metric families.

use numkong::MeshAlignment;

let source = [[0.0_f32, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];
let target = [[0.0_f32, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];

let result = f32::kabsch(&source, &target).unwrap();
assert!(result.rmsd < 1e-6);
assert!((result.scale - 1.0).abs() < 1e-6);

// Umeyama with known 2x scaling
let scaled = [[0.0_f32, 0.0, 0.0], [2.0, 0.0, 0.0], [0.0, 2.0, 0.0]];
let result = f32::umeyama(&source, &scaled).unwrap();
assert!(result.rmsd < 1e-6);
assert!((result.scale - 2.0).abs() < 0.01);

Parallelism and Fork Union

NumKong does not own a thread pool. The parallel feature adds host-side orchestration helpers via Fork Union, not a hidden scheduler.

use numkong::{PackedMatrix, Tensor};
use fork_union::ThreadPool;

let a = Tensor::<f32>::try_full(&[4096, 768], 1.0).unwrap();
let b = Tensor::<f32>::try_full(&[8192, 768], 1.0).unwrap();
let mut pool = ThreadPool::try_spawn(4).unwrap();

// GEMM-like: rows of A partitioned across threads, one shared packed B
let b_packed = PackedMatrix::try_pack(&b).unwrap();
let c = a.dots_packed_parallel(&b_packed, &mut pool);
assert_eq!(c.shape(), &[4096, 8192]);

// SYRK-like: row windows of one square output partitioned across threads
let gram = a.dots_symmetric_parallel(&mut pool);
assert_eq!(gram.shape(), &[4096, 4096]);

Rayon or a manual thread pool can still work if the rest of your application already depends on them.

numkong 7.0.0