NumKong for Rust

NumKong's Rust crate keeps most of the native kernel surface while expressing it in Rust-native terms. Rust is a natural fit for NumKong when you want static typing, explicit ownership, and strong container APIs without giving up mixed precision. Traits cover scalar metric families. Tensor, Vector, and packed matrix types cover the higher-level workflows. Custom allocators, low-precision storage wrappers, and explicit row-contiguity checks stay visible instead of being hidden behind a dynamic runtime. The crate makes the storage policy and result promotion visible. That matters for fp16, bf16, fp8, packed bits, and strided reductions.

Quickstart

use numkong::{configure_thread, Dot};

fn main() {
    configure_thread();
    let a = [1.0_f32, 2.0, 3.0];
    let b = [4.0_f32, 5.0, 6.0];
    let dot = f32::dot(&a, &b).unwrap();
    println!("dot={dot}");
}

Highlights

This is the most fully featured high-level SDK after Python. It is a good fit if you want most of the native breadth without dropping into a manual FFI layer.

Trait-first scalar API. Type::operation(&a, &b) stays compact and predictable. Allocator-aware tensors. Tensor, PackedMatrix, and MaxSimPackedMatrix can use custom allocators. Storage-first low precision. f16, bf16, fp8, fp6, and packed integer wrappers are first-class types. Matrix kernels with explicit contracts. Packed and symmetric kernels validate shapes and row contiguity. No hidden thread pool. Parallel helpers remain host-controlled. Fork Union support. The parallel feature is the intended native orchestration layer.

Ecosystem Comparison

Feature	NumKong	nalgebra	ndarray
Operation families	dots, distances, binary, probability, geospatial, curved, mesh, sparse, MaxSim, elementwise, reductions, cast, trig	linear algebra, decompositions	general n-dimensional arithmetic
Precision	BFloat16 through sub-byte; automatic widening; Kahan summation; 0 ULP in Float32/Float64	Float32/Float64 only; no widening; standard accuracy	Float32/Float64 only; no widening; standard accuracy
Runtime SIMD dispatch	auto-selects best ISA per-thread at runtime across x86, ARM, RISC-V	none	none
Packed matrix, GEMM-like	pack once, reuse across query batches	standard matmul; no persistent packing	`dot` for matmul; no persistent packing
Symmetric kernels, SYRK-like	skip duplicate pairs, up to 2x speedup for self-distance	no duplicate-pair skipping	no duplicate-pair skipping
Memory model	Caller-owned; `Tensor`/`PackedMatrix` support custom allocators	Heap-allocated matrices; custom storage trait	Heap-allocated; no custom allocator support
Host-side parallelism	row-range partitioning via reusable `ThreadPool`; no hidden threads	Rayon-based parallelism possible	Rayon-based parallelism possible

NumKong validates f16 and bf16 interop against the half crate in its own test suite. That lets you move between ecosystem-standard half types and NumKong's kernel-facing wrappers without ambiguity.

Installation

Minimal:

[dependencies]
numkong = "7"

With host-side parallel helpers:

[dependencies]
numkong = { version = "7", features = ["parallel", "std"] }

Compilation and Backend Selection

The crate uses the cc build system to compile the C backend with NK_DYNAMIC_DISPATCH=1 automatically. All supported backends for the target architecture are compiled into a single binary and selected at runtime.

The two Cargo features are std, which enables standard library support, and parallel, which adds host-side orchestration via ForkUnion and implies std.

Backend selection follows the target architecture. ARM gets NEON, SVE, and SME, with SME available on Linux, FreeBSD, and macOS. x86-64 gets Haswell (AVX2), Skylake/Icelake/Sapphire Rapids AVX-512 variants, and AMX on Linux and Windows only. RISC-V gets RVV backends on Linux and FreeBSD. WASM gets relaxed v128.

Individual backends can be disabled through environment variables. Any NK_TARGET_* variable set to 0 or false disables that backend. Backends not explicitly disabled are enabled by default for the target platform.

NK_TARGET_NEON=0 cargo build
NK_TARGET_SVE=0 NK_TARGET_SME=0 cargo build

If a backend fails to compile, the build system automatically disables it and retries with the remaining backends. A warning is emitted for each disabled backend.

Dynamic Dispatch and Capabilities

configure_thread configures rounding behavior and enables CPU-specific acceleration features such as Intel AMX. It must be called once per thread before using any SIMD-accelerated operations.

use numkong::{available, configure_thread, cap};

let caps = available();
configure_thread();

if caps & cap::SAPPHIREAMX != 0 {
    println!("AMX available");
}

Call configure_thread at the start of every thread that will invoke NumKong kernels. In a thread-pool setting, each worker thread needs its own call. The function is idempotent and cheap to call more than once on the same thread.

Core Traits

The crate root re-exports the main metric families:

Dot, VDot, Angular, Euclidean
Hamming, Jaccard
KullbackLeibler, JensenShannon
Haversine, Vincenty
Bilinear, Mahalanobis
ReduceMoments, ReduceMinMax
EachScale, EachSum, EachBlend, EachFMA

The standard call shape is:

use numkong::{Dot, JensenShannon, Jaccard, u1x8};

let a = [1.0_f32, 2.0, 3.0];
let b = [4.0_f32, 5.0, 6.0];
let dot = f32::dot(&a, &b).unwrap();

let bits_a = [u1x8(0b11110000), u1x8(0b00001111)];
let bits_b = [u1x8(0b11110000), u1x8(0b11110000)];
let jaccard = u1x8::jaccard(&bits_a, &bits_b).unwrap();

let p = [0.2_f32, 0.3, 0.5];
let q = [0.1_f32, 0.3, 0.6];
let jsd = f32::jensenshannon(&p, &q).unwrap();

println!("{dot} {jaccard} {jsd}");

Dot Products

Dot products span real, complex, quantized, and packed storage types.

use numkong::{Dot, VDot, f32c};

let a = [f32c { re: 1.0, im: 2.0 }, f32c { re: 3.0, im: 4.0 }];
let b = [f32c { re: 5.0, im: 6.0 }, f32c { re: 7.0, im: 8.0 }];

let dot = f32c::dot(&a, &b).unwrap();
let vdot = f32c::vdot(&a, &b).unwrap(); // like numpy.vdot, conjugated

println!("{dot:?} {vdot:?}");

Dense Distances

The dense spatial family covers sqeuclidean, euclidean, and angular. The main value over naive loops is the combination of SIMD and safer accumulation policy.

use numkong::Euclidean;

let a = [1_i8, 2, 3, 4];
let b = [4_i8, 3, 2, 1];

let distance = i8::euclidean(&a, &b).unwrap(); // widened output, not int8
println!("{distance}");

Scalar Types and Promotions

The scalar wrappers are storage-first types. They are not decorative aliases over f32.

Type	Layout	Bytes	Range	Inf	NaN
`f16`	1+5+10	2	±65504	yes	yes
`bf16`	1+8+7	2	±3.4×10³⁸	yes	yes
`e4m3`	1+4+3	1	±448	no	yes
`e5m2`	1+5+2	1	±57344	yes	yes
`e2m3`	1+2+3 (6 bit)	1	±7.5	no	no
`e3m2`	1+3+2 (6 bit)	1	±28	no	no
`u1x8`	8 packed bits	1	0–1 per bit	—	—
`u4x2`	2×4-bit uint	1	0–15 per nib	—	—
`i4x2`	2×4-bit int	1	−8–7 per nib	—	—

The trait hierarchy documents intent:

StorageElement — raw storable element type.
NumberLike — adds numeric conversion and ordering.
FloatConvertible — adds unpacking and float-domain conversion.

The output type is intentionally wider than the storage type for many operations. For example, i8::dot returns i32. f32::dot returns a wider accumulator type. Moments reductions widen even more aggressively.

Set Similarity

Packed-binary metrics work on packed words instead of boolean slices. That is the right model once the workload is "semantic hash" rather than "array of booleans".

use numkong::{Hamming, Jaccard, u1x8};

let a = [u1x8(0b10101010), u1x8(0b11110000)];
let b = [u1x8(0b10101110), u1x8(0b11000000)];
let hamming = u1x8::hamming(&a, &b).unwrap();
let jaccard = u1x8::jaccard(&a, &b).unwrap();

Integer set Jaccard works on sorted arrays of integer identifiers.

let set_a = [1_u32, 3, 5, 7, 9];
let set_b = [3_u32, 5, 8, 9, 10];
let jaccard_sets = u32::jaccard(&set_a, &set_b).unwrap();
assert!(jaccard_sets > 0.0 && jaccard_sets < 1.0); // |A ∩ B| / |A ∪ B|

Probability Metrics

use numkong::{JensenShannon, KullbackLeibler};

let p = [0.2_f32, 0.3, 0.5], q = [0.1_f32, 0.3, 0.6];
let kl_forward = f32::kullbackleibler(&p, &q).unwrap();
let kl_reverse = f32::kullbackleibler(&q, &p).unwrap();
assert!(kl_forward != kl_reverse); // KLD is asymmetric

let js_forward = f32::jensenshannon(&p, &q).unwrap();
let js_reverse = f32::jensenshannon(&q, &p).unwrap();
assert!((js_forward - js_reverse).abs() < 1e-6, "JSD is symmetric");

Geospatial Metrics

Inputs are latitudes and longitudes in radians. Outputs are meters.

use numkong::{Haversine, Vincenty};

// Statue of Liberty (40.6892°N, 74.0445°W) → Big Ben (51.5007°N, 0.1246°W)
let liberty_lat = [0.7101605100_f64], liberty_lon = [-1.2923203180_f64];
let big_ben_lat = [0.8988567821_f64], big_ben_lon = [-0.0021746802_f64];
let mut distance = [0.0_f64; 1];
f64::vincenty(&liberty_lat, &liberty_lon, &big_ben_lat, &big_ben_lon, &mut distance).unwrap();  // ≈ 5,589,857 m (ellipsoidal, baseline)
f64::haversine(&liberty_lat, &liberty_lon, &big_ben_lat, &big_ben_lon, &mut distance).unwrap(); // ≈ 5,543,723 m (spherical, ~46 km less)

// Vincenty in f32 — drifts ~2 m from f64
let liberty_lat32 = [0.7101605100_f32], liberty_lon32 = [-1.2923203180_f32];
let big_ben_lat32 = [0.8988567821_f32], big_ben_lon32 = [-0.0021746802_f32];
let mut distance_f32 = [0.0_f32; 1];
f32::vincenty(&liberty_lat32, &liberty_lon32, &big_ben_lat32, &big_ben_lon32, &mut distance_f32).unwrap(); // ≈ 5,589,859 m (+2 m drift)

Curved Metrics

Curved-space kernels combine vectors with an extra metric tensor or covariance inverse.

use numkong::{Bilinear, Mahalanobis, f32c};

// Complex bilinear form: aᴴ M b
let a = [f32c { re: 1.0, im: 0.0 }; 16];
let b = [f32c { re: 0.0, im: 1.0 }; 16];
let metric = [f32c { re: 1.0, im: 0.0 }; 16 * 16];
let bilinear = f32c::bilinear(&a, &b, &metric).unwrap();

// Real Mahalanobis distance: √((a−b)ᵀ M⁻¹ (a−b))
let x = [1.0_f32; 32];
let y = [2.0_f32; 32];
let mut inv_cov = vec![0.0_f32; 32 * 32];
for i in 0..32 { inv_cov[i * 32 + i] = 1.0; } // identity matrix
let distance = f32::mahalanobis(&x, &y, &inv_cov).unwrap();

Vectors, Tensors, Views, and Spans

The container model is unusual enough that it needs direct documentation.

Vector<T> owns one-dimensional storage.
VectorView<'a, T> is an immutable borrowed view.
VectorSpan<'a, T> is a mutable borrowed view.
Tensor<T, A, MAX_RANK> owns N-dimensional storage and can use a custom allocator.
TensorView and TensorSpan are the borrowed forms.
Matrix<T> is a rank-2 alias over Tensor<T, _, 2>.

The allocator story is explicit. Tensor and PackedMatrix default to Global. The underlying layout uses SIMD_ALIGNMENT == 64 for owned allocations. That does not mean callers must align their source buffers manually. It means owned outputs and packed payloads are allocated in a SIMD-friendly way when the crate owns them.

use numkong::{RangeStep, SliceRange, Tensor};

let t = Tensor::<f32>::try_from_slice(&[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], &[3, 3]).unwrap();

let col = t.slice((.., 1_usize)).unwrap();                  // t[:, 1]  — column 1
let rows = t.slice((0..2_usize, ..)).unwrap();              // t[0:2, :] — first two rows
let tail = t.slice((-2_isize.., ..)).unwrap();              // t[-2:, :] — last two rows
let neg = t.slice((.., -2..-1_isize)).unwrap();             // t[:, -2:-1]
let step = t.slice((.., RangeStep::new(0, 3, 2))).unwrap(); // t[:, ::2]

// Explicit &[SliceRange] syntax also works
let col = t.slice(&[SliceRange::full(), SliceRange::index(1)]).unwrap();

Tuple elements implement SliceArg — each monomorphized with zero runtime dispatch:

Rust syntax	Meaning
`..`	all
`0_usize` / `-1_isize`	single index (negative wraps from end)
`1..4_usize` / `-3..-1_isize`	half-open range
`..3_usize` / `..-1_isize`	from start
`1_usize..` / `-2_isize..`	to end
`0..=2_usize` / `-3..=-1_isize`	inclusive range
`RangeStep::new(0, 6, 2)`	stepped (no Rust literal)

Integer literals default to i32 — use _usize / _isize suffixes. Negative isize values wrap from the dimension end, like Python.

Iteration works at the logical-dimension level. For sub-byte types like i4x2 (2 nibbles per byte), iterating a 3-element vector yields 6 dimensions. Immutable iterators (iter()) yield DimRef<T>, which dereferences to T::DimScalar. Mutable iterators (iter_mut()) yield DimMut<T>, which writes back on drop — the only way to mutate individual nibbles or bits.

use numkong::{Vector, i4x2};

let mut nibbles = Vector::<i4x2>::try_zeros(4).unwrap();
for (i, mut dim) in nibbles.iter_mut().enumerate() {
    *dim = i as i8;
}
assert_eq!(nibbles.try_get(0_usize).unwrap(), 0);
assert_eq!(nibbles.try_get(3_usize).unwrap(), 3);

Vectors and tensors can be converted between each other without copying:

use numkong::{Vector, Tensor};

let v = Vector::<f32>::try_from_scalars(&[1.0, 2.0, 3.0]).unwrap();
let t: Tensor<f32, _, 8> = v.try_into_tensor().unwrap();
assert_eq!(t.shape(), &[3]);
let v2 = t.try_into_vector().unwrap();
assert_eq!(v2.dims(), 3);

The main layout rules are:

General slicing and transposition are supported by views.
Elementwise and many reduction kernels accept strided views.
Matrix-style kernels require rank-2 inputs with contiguous rows.
A tensor can be non-contiguous overall and still have contiguous rows.
Some reductions have SIMD kernels for strided lanes.
Some backends still fall back depending on alignment and dtype.

Sub-byte types (i4x2, u4x2, u1x8) use logical shapes. A shape of [8] for i4x2 means 8 nibbles (stored in 4 bytes), not 8 bytes. The innermost dimension must be divisible by dimensions_per_value() (2 for nibble types, 8 for bit types). Transpose and reshape are not supported for sub-byte types — they return SubByteUnsupported.

Elementwise Operations

Elementwise kernels live on tensors and views. They are not a promise that every arbitrary strided view gets the same SIMD path on every backend.

use numkong::Tensor;

let a = Tensor::<f32>::try_from_slice(&[1.0, 2.0, 3.0, 4.0], &[2, 2]).unwrap();
let b = Tensor::<f32>::try_full(&[2, 2], 2.0).unwrap();

let blended = a.view().try_blend_tensor(&b.view(), 0.25, 0.75).unwrap();
let sines = blended.sin().unwrap();

assert_eq!(sines.shape(), &[2, 2]);

Compound assignment operators work in-place:

use numkong::Tensor;

let mut t = Tensor::<f32>::try_full(&[4], 1.0).unwrap();
t += 10.0;
t -= 0.5;
t *= 2.0;

Trigonometry

The trigonometric kernels share the tensor and view surface. They are useful both directly and as a sanity check that the container path is not just about matrix kernels.

use numkong::Tensor;

let a = Tensor::<f32>::try_from_slice(&[0.0, 1.0, 2.0, 3.0], &[2, 2]).unwrap();
let c = a.cos().unwrap();
let s = a.sin().unwrap();

assert_eq!(c.shape(), &[2, 2]);
assert_eq!(s.shape(), &[2, 2]);

Moments Reductions

Moments reductions return both sum and sum-of-squares. That is the right building block for norms and variance-like workflows.

use numkong::{ReduceMoments, Tensor};

let narrow = Tensor::<u8>::try_full(&[1024], 255).unwrap();
let (sum, sumsq) = narrow.try_moments_all().unwrap();

assert!(sum > 255);      // a naive u8 accumulation would overflow immediately
assert!(sumsq > 255u64); // same for sum-of-squares

The important documentation point is not just "wider outputs exist". It is that the API makes the widened outputs part of the type story.

Min/Max Reductions

Min/max reductions return a MinMaxResult with both the value and its flat index:

use numkong::Tensor;

let t = Tensor::<f32>::try_from_slice(&[
    3.0, 0.0, 7.0,
    1.0, 2.0, 5.0,
    4.0, -1.0, 6.0,
], &[3, 3]).unwrap();

let second_column = t.slice((.., 1_usize)).unwrap();  // t[:, 1]
let idx = second_column.try_argmin_all().unwrap();

assert_eq!(idx, 2);

Sparse Operations and Intersections

Sparse helpers cover both sorted-index intersection and weighted sparse dot products.

use numkong::{SparseIntersect, SparseDot};

let a_idx = [1_u32, 3, 5, 7], b_idx = [3_u32, 4, 5, 8];
let count = u32::sparse_intersection_size(&a_idx, &b_idx);
assert_eq!(count, 2); // indices 3 and 5

let a_weights = [1.0_f32, 2.0, 3.0, 4.0], b_weights = [5.0_f32, 6.0, 7.0, 8.0];
let dot = u32::sparse_dot(&a_idx, &b_idx, &a_weights, &b_weights).unwrap();
assert!(dot > 0.0); // weighted dot over shared indices

Packed Matrix Kernels for GEMM-Like Workloads

Packed kernels are the main "matrix throughput" path in the crate. They are GEMM-like in workload shape. They are not a thin BLAS clone.

use numkong::{PackedMatrix, Tensor};

let a = Tensor::<f32>::try_full(&[1024, 512], 1.0).unwrap();
let b = Tensor::<f32>::try_full(&[256, 512], 1.0).unwrap();

let b_packed = PackedMatrix::try_pack(&b).unwrap();
let c = a.dots_packed(&b_packed);

assert_eq!(c.shape(), &[1024, 256]);

The useful economics are:

pack B once
reuse it across many A batches
convert or pad once during packing instead of on every multiply
reuse precomputed norms for angulars_packed and euclideans_packed

The crate checks row contiguity because these kernels assume contiguous rows. Caller-side source alignment is not required. The owned packed buffer handles its own aligned allocation internally.

Symmetric Kernels for SYRK-Like Workloads

Symmetric kernels are for self-similarity and self-distance. They are SYRK-like in shape. They avoid duplicate (i, j) and (j, i) work.

use numkong::Tensor;

let vectors = Tensor::<f32>::try_full(&[100, 768], 1.0).unwrap();
let gram = vectors.view().try_dots_symmetric().unwrap();

assert_eq!(gram.shape(), &[100, 100]);

This family is also where row-window partitioning becomes the natural parallel model. That is structurally different from packed GEMM-style work against a shared packed RHS.

MaxSim and ColBERT-Style Late Interaction

MaxSim is the late-interaction primitive used by systems such as ColBERT. It is not "just another matrix multiply".

use numkong::{MaxSimPackedMatrix, Tensor};

let queries = Tensor::<f32>::try_full(&[4, 16], 1.0).unwrap();
let docs = Tensor::<f32>::try_full(&[8, 16], 1.0).unwrap();

let queries_packed = queries.view().try_maxsim_pack().unwrap();
let docs_packed = docs.view().try_maxsim_pack().unwrap();
let score = queries_packed.score(&docs_packed);

assert!(score.is_finite());

Geometric Mesh Alignment

Mesh alignment returns transforms, scales, and RMSD values. That is a different API shape from the scalar metric families.

use numkong::MeshAlignment;

let source = [[0.0_f32, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];
let target = [[0.0_f32, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0]];

let result = f32::kabsch(&source, &target).unwrap();
assert!(result.rmsd < 1e-6);
assert!((result.scale - 1.0).abs() < 1e-6);

// Umeyama with known 2x scaling
let scaled = [[0.0_f32, 0.0, 0.0], [2.0, 0.0, 0.0], [0.0, 2.0, 0.0]];
let result = f32::umeyama(&source, &scaled).unwrap();
assert!(result.rmsd < 1e-6);
assert!((result.scale - 2.0).abs() < 0.01);

Tolerance Comparison

Exact floating-point equality is rarely what you want after arithmetic. allclose() checks every element pair with the formula:

$$ |a - b| \leq \text{atol} + \text{rtol} \cdot |b| $$

Available on Vector, VectorView, VectorSpan, Tensor, TensorView, and TensorSpan. For tensors, allclose is provided by the AllCloseOps trait — import it if calling on a TensorRef implementor. Shape mismatch returns false. The scalar helper is_close is re-exported at crate root.

use numkong::{is_close, Vector, Tensor};

// Scalar check
assert!(is_close(1.0, 1.0 + 1e-8, 1e-6, 0.0));

// Vector tolerance check
let a = Vector::<f32>::try_full(3, 1.0).unwrap();
let b = Vector::<f32>::try_full(3, 1.0 + 1e-7).unwrap();
assert!(a.allclose(&b, 1e-6, 0.0));

// Tensor tolerance check
let ta = Tensor::<f32>::try_full(&[2, 3], 1.0).unwrap();
let tb = Tensor::<f32>::try_full(&[2, 3], 1.0 + 1e-7).unwrap();
assert!(ta.allclose(&tb, 1e-6, 0.0));

Type Casting

The cast function performs bulk conversion between contiguous slices. Any pair of types that implement CastDtype (all NumberLike scalars) can be converted.

use numkong::{cast, f16, bf16};

let src: Vec<f32> = vec![1.0, 2.0, 3.0];
let mut dst: Vec<f16> = vec![f16::from(0.0_f32); 3];
cast(&src, &mut dst);
assert!((dst[0].to_f32() - 1.0).abs() < 0.01);

Tensor, TensorView, and TensorSpan expose casting via the CastOps trait. try_cast_dtype() allocates a new tensor; try_cast_dtype_into() writes into a pre-allocated TensorSpan. Strided and non-contiguous views are supported: the implementation scans strides from the innermost dimension outward to find the longest contiguous tail, then walks the outer dimensions and casts each contiguous block in a single kernel call.

use numkong::{Tensor, f16};

let src = Tensor::<f32>::try_full(&[4, 4], 1.0).unwrap();
let mut dst = Tensor::<f16>::try_zeros(&[4, 4]).unwrap();
src.view().try_cast_dtype_into(&mut dst.span()).unwrap();

Parallelism and ForkUnion

NumKong does not own a thread pool. The parallel feature adds host-side orchestration helpers via ForkUnion, not a hidden scheduler.

use numkong::{PackedMatrix, Tensor};
use fork_union::ThreadPool;

let a = Tensor::<f32>::try_full(&[4096, 768], 1.0).unwrap();
let b = Tensor::<f32>::try_full(&[8192, 768], 1.0).unwrap();
let mut pool = ThreadPool::try_spawn(4).unwrap();

// GEMM-like: rows of A partitioned across threads, one shared packed B
let b_packed = PackedMatrix::try_pack(&b).unwrap();
let c = a.dots_packed_parallel(&b_packed, &mut pool);
assert_eq!(c.shape(), &[4096, 8192]);

// SYRK-like: row windows of one square output partitioned across threads
let gram = a.dots_symmetric_parallel(&mut pool);
assert_eq!(gram.shape(), &[4096, 4096]);

Rayon or a manual thread pool can still work if the rest of your application already depends on them.

Addressing External Memory

Views wrap raw pointers without ownership, owned containers accept custom allocators, and the scalar trait API works on any &[T] regardless of how the memory was allocated.

VectorView::from_raw_parts and TensorView::from_raw_parts wrap device-accessible or externally allocated memory. The mutable counterparts VectorSpan::from_raw_parts and TensorSpan::from_raw_parts work the same way with *mut T.

use numkong::{VectorView, TensorView};

let embeddings_ptr: *const f32 = /* from CUDA, mmap, or FFI */;
let embeddings = unsafe {
    VectorView::from_raw_parts(embeddings_ptr, 1024, std::mem::size_of::<f32>() as isize)
};

let shape = [32, 64];
let strides = [64 * 4, 4]; // row-major f32
let matrix = unsafe { TensorView::<f32>::from_raw_parts(embeddings_ptr, &shape, &strides) };

Owned containers accept any allocator. A CUDA unified memory allocator looks like this:

use std::alloc::{Allocator, AllocError, Layout};
use std::ptr::NonNull;
use numkong::Vector;

struct CudaAllocator;

unsafe impl Allocator for CudaAllocator {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> {
        let raw = unsafe { cuda_malloc_managed(layout.size()) };
        let base = NonNull::new(raw).ok_or(AllocError)?;
        Ok(NonNull::slice_from_raw_parts(base, layout.size()))
    }
    unsafe fn deallocate(&self, block: NonNull<u8>, _layout: Layout) {
        cuda_free(block.as_ptr());
    }
}

let queries = Vector::<f32, CudaAllocator>::try_zeros_in(1024, CudaAllocator).unwrap();

The trait-based scalar API works on any &[T] — Vec, mmap, arena, or pinned buffer:

use numkong::Dot;

let weights: &[f32] = /* any contiguous slice */;
let similarity = f32::dot(weights, weights).unwrap();

numkong 7.1.0