VaeaNTT is a Rust library providing NTT (Number Theoretic Transform) implementations optimized for ARM NEON (aarch64), with a portable scalar fallback for all platforms.
- ๐ ARM NEON native โ all butterfly stages vectorized with 4-wide
u32SIMD - ๐ Two pipelines โ 28-bit primes (
ntt32) and 60โ62 bit primes (ntt64) - ๐ฆ
no_stdโ runs on bare-metal, requires onlyalloc - ๐ Constant-time โ branchless arithmetic, no data-dependent branches
- ๐ฏ Runtime-generic โ any NTT-friendly prime, not hardcoded to one scheme
- ๐ Multi-language โ C, C++, JS/WASM bindings via Diplomat FFI
Table of Contents
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
Basic NTT
use Ntt32Context;
// Any NTT-friendly prime < 2^28
let ctx = new; // ML-DSA prime
let mut data = vec!;
ctx.forward; // Coefficient โ NTT domain
ctx.inverse; // NTT domain โ Coefficient
assert!;
Post-Quantum Preset (ML-DSA)
use ;
let ntt = new; // NIST Level 3
let mut poly = vec!;
poly = 1;
ntt.forward;
ntt.inverse;
assert_eq!;
Polynomial Multiplication
use Ntt32Context;
let ctx = new;
// (1 + x) ร (1 + x) = 1 + 2x + xยฒ in Z_q[X]/(X^256 + 1)
let mut a = vec!;
a = 1; a = 1;
let result = ctx.negacyclic_mul;
assert_eq!;
Supported Parameters
VaeaNTT accepts any prime q and power-of-two N satisfying q โก 1 (mod 2N).
ntt32 โ Primes < 2ยฒโธ
| Use Case | q | Bits | Tested N |
|---|---|---|---|
| ML-DSA | 8 380 417 | 23 | 256 |
| Falcon | 12 289 | 14 | 512, 1024 |
| NewHope | 7 681 | 13 | 512, 1024 |
| FHE (CKKS/BGV CRT limbs) | any < 2ยฒโธ | โค 28 | up to 32 768 |
ntt64 โ Primes 60โ62 bits
For FHE-compatible 64-bit primes. Includes built-in constants for common primes
(PRIME_SEAL, PRIME_60_1, PRIME_62_1, etc.).
Note on ML-KEM: ML-KEM uses q = 3329 with an incomplete NTT (size-128 over coefficient pairs), not a standard negacyclic NTT. VaeaNTT's standard NTT works with q = 3329 for N โค 128. A dedicated incomplete NTT module for ML-KEM is planned.
API Reference
Modules
| Module | Description |
|---|---|
ntt32 |
NTT for primes < 2ยฒโธ. ARM NEON vectorized + scalar fallback. |
ntt64 |
NTT for 60โ62 bit primes. Barrett and Montgomery arithmetic. |
pq |
Post-quantum presets for ML-DSA. |
poly |
Polynomial arithmetic over Z_q[X]/(X^N + 1), 64-bit coefficients. |
rns |
Residue Number System (multi-prime CRT) for FHE. |
ffi |
FFI bindings via Diplomat (C, C++, JS/WASM). Requires ffi feature. |
Ntt32Context
// Construction
let ctx = new; // panics on invalid params
let ctx = try_new?; // returns Result<_, NttError>
// Forward / Inverse NTT (in-place)
ctx.forward; // coefficient โ NTT domain
ctx.inverse; // NTT โ coefficient (ร Nโปยน)
ctx.inverse_lazy; // NTT โ coefficient (no Nโปยน)
// Polynomial multiplication in Z_q[X]/(X^N + 1)
let result = ctx.negacyclic_mul; // allocating
ctx.negacyclic_mul_into; // zero-allocation
On aarch64, forward/inverse dispatch to NEON automatically.
On other architectures, a scalar fallback using Shoup multiplication and Harvey lazy butterflies is used.
PqNtt
use ;
let ntt = new;
ntt.forward;
ntt.inverse;
let product = ntt.multiply;
// Available presets:
// PqScheme::MlDsa44 โ NIST Level 2 (q=8380417, N=256)
// PqScheme::MlDsa65 โ NIST Level 3 (q=8380417, N=256)
// PqScheme::MlDsa87 โ NIST Level 5 (q=8380417, N=256)
Utilities
use ;
// Generate NTT-friendly primes < 2^28 for a given N
let primes = generate_primes_28; // 3 primes for N=1024
Features
| Feature | Default | Description |
|---|---|---|
std |
โ | Enables std::error::Error impl on NttError |
rand |
โ | Random polynomial generation (Poly64::new_random(), etc.) |
ffi |
โ | Diplomat FFI bindings (C, C++, JS/WASM) |
no_std Usage
[]
= { = "0.1", = false }
Requires alloc. Zero runtime dependencies in this configuration.
Performance
Measured with Criterion on Apple M3 Pro (aarch64), --release, single-threaded.
Forward NTT (ntt32, q = 12 289)
| N | Latency | Throughput |
|---|---|---|
| 64 | 66 ns | 970 M coeff/s |
| 256 | 234 ns | 1.09 G coeff/s |
| 1 024 | 1.19 ยตs | 860 M coeff/s |
| 4 096 | 5.7 ยตs | 719 M coeff/s |
| 8 192 | 11.4 ยตs | 719 M coeff/s |
| 16 384 | 27.2 ยตs | 602 M coeff/s |
| 32 768 | 58.5 ยตs | 560 M coeff/s |
Inverse NTT (ntt32, q = 12 289)
| N | Latency |
|---|---|
| 256 | 320 ns |
| 1 024 | 1.55 ยตs |
| 4 096 | 7.7 ยตs |
| 32 768 | 63.8 ยตs |
Negacyclic Polynomial Multiplication
Two forward NTTs + pointwise multiply + inverse NTT.
| N | Total |
|---|---|
| 256 | 1.08 ยตs |
| 1 024 | 4.97 ยตs |
| 4 096 | 23.3 ยตs |
Run
cargo benchon your hardware for your own numbers. Results vary with hardware and system load. Disable CPU frequency scaling for reproducible measurements.
Architecture
src/
โโโ ntt32/ # 28-bit NTT pipeline
โ โโโ arith.rs # Branchless modular arithmetic (add, sub, mul, pow, inv)
โ โโโ context.rs # Ntt32Context โ unified API with NEON/scalar dispatch
โ โโโ neon.rs # ARM NEON intrinsics (4-stage fused butterflies)
โ โโโ scalar.rs # Portable scalar (Shoup multiplication, Harvey butterfly)
โ โโโ prime.rs # NTT-friendly prime generation, primitive root finding
โโโ ntt64/ # 64-bit NTT pipeline (Barrett + Montgomery)
โ โโโ arith.rs # 64-bit modular arithmetic
โ โโโ context.rs # Ntt64Context
โ โโโ prime.rs # 64-bit prime utilities
โโโ pq.rs # Post-quantum presets (ML-DSA)
โโโ poly.rs # Poly64 โ polynomial over Z_q[X]/(X^N+1)
โโโ rns.rs # RNS/CRT multi-prime decomposition
โโโ ffi.rs # Diplomat FFI bridge
โโโ lib.rs
Design Rationale
- ARM NEON native: 4ร
u32lanes.u32 ร u32products fit inu64, no widening tou128. - Lazy reduction: With
q < 2ยฒโธ, intermediates3q < 2ยณโฐfit inu32, enabling deferred Barrett reduction across multiple butterfly stages. - PQ aligned: All NIST lattice standards use primes โค 23 bits โ well within 28 bits.
- FHE schemes (CKKS, BGV) use 60โ62 bit primes โ these don't fit in
u32. ntt64provides Barrett and Montgomery arithmetic for large primes.- RNS combines multiple
ntt64contexts for multi-precision FHE computation.
Security
| Property | Guarantee |
|---|---|
| Constant-time | All arithmetic uses branchless SIMD masks (vcgeq + vandq), no data-dependent branches. |
| Input validation | try_new() rejects non-prime q, non-power-of-two N, and non-NTT-friendly primes. |
| Memory safety | All NEON accesses are bounds-checked via loop guards. unsafe limited to NEON intrinsics. |
| Thread safety | Ntt32Context is Send + Sync. Verified with 8 threads ร 100 iterations. |
See SECURITY.md for the vulnerability disclosure policy.
Testing
# Unit + integration + doc tests
# Benchmarks
# Security & exhaustive validation
License
This project is dual-licensed:
Open Source โ AGPL-3.0-or-later
Free for open-source projects. See LICENSE.
If you use VaeaNTT in a network service or distribute it, you must release your complete source code under the AGPL. This applies to modified and unmodified usage.
Commercial License
For closed-source, proprietary, or embedded use, a commercial license is available that removes all AGPL obligations.
Contact: alexis@vaea.tech