vaea-ntt 0.1.1

VaeaNTT is a Rust library providing NTT (Number Theoretic Transform) implementations optimized for ARM NEON (aarch64), with a portable scalar fallback for all platforms.

🚀 ARM NEON native — all butterfly stages vectorized with 4-wide u32 SIMD
🔀 Two pipelines — 28-bit primes (ntt32) and 60–62 bit primes (ntt64)
📦 no_std — runs on bare-metal, requires only alloc
🔒 Constant-time — branchless arithmetic, no data-dependent branches
🎯 Runtime-generic — any NTT-friendly prime, not hardcoded to one scheme
🌐 Multi-language — C, C++, JS/WASM bindings via Diplomat FFI

Quick Start
Supported Parameters
API Reference
Performance
Architecture
Security
Testing
License

Quick Start

Add to your Cargo.toml:

[dependencies]
vaea-ntt = "0.1"

Basic NTT

use vaea_ntt::ntt32::Ntt32Context;

// Any NTT-friendly prime < 2^28
let ctx = Ntt32Context::new(256, 8_380_417); // ML-DSA prime

let mut data = vec![42u32; 256];
ctx.forward(&mut data);   // Coefficient → NTT domain
ctx.inverse(&mut data);   // NTT domain → Coefficient
assert!(data.iter().all(|&x| x == 42));

Post-Quantum Preset (ML-DSA)

use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65); // NIST Level 3
let mut poly = vec![0u32; 256];
poly[0] = 1;
ntt.forward(&mut poly);
ntt.inverse(&mut poly);
assert_eq!(poly[0], 1);

Polynomial Multiplication

use vaea_ntt::ntt32::Ntt32Context;

let ctx = Ntt32Context::new(256, 8_380_417);

// (1 + x) × (1 + x) = 1 + 2x + x²  in Z_q[X]/(X^256 + 1)
let mut a = vec![0u32; 256];
a[0] = 1; a[1] = 1;
let result = ctx.negacyclic_mul(&a, &a);
assert_eq!(&result[..3], &[1, 2, 1]);

Supported Parameters

VaeaNTT accepts any prime q and power-of-two N satisfying q ≡ 1 (mod 2N).

`ntt32` — Primes < 2²⁸

Use Case	q	Bits	Tested N
ML-DSA	8 380 417	23	256
Falcon	12 289	14	512, 1024
NewHope	7 681	13	512, 1024
FHE (CKKS/BGV CRT limbs)	any < 2²⁸	≤ 28	up to 32 768

`ntt64` — Primes 60–62 bits

For FHE-compatible 64-bit primes. Includes built-in constants for common primes (PRIME_SEAL, PRIME_60_1, PRIME_62_1, etc.).

Note on ML-KEM: ML-KEM uses q = 3329 with an incomplete NTT (size-128 over coefficient pairs), not a standard negacyclic NTT. VaeaNTT's standard NTT works with q = 3329 for N ≤ 128. A dedicated incomplete NTT module for ML-KEM is planned.

API Reference

Modules

Module	Description
`ntt32`	NTT for primes < 2²⁸. ARM NEON vectorized + scalar fallback.
`ntt64`	NTT for 60–62 bit primes. Barrett and Montgomery arithmetic.
`pq`	Post-quantum presets for ML-DSA.
`poly`	Polynomial arithmetic over Z_q[X]/(X^N + 1), 64-bit coefficients.
`rns`	Residue Number System (multi-prime CRT) for FHE.
`ffi`	FFI bindings via Diplomat (C, C++, JS/WASM). Requires `ffi` feature.

`Ntt32Context`

// Construction
let ctx = Ntt32Context::new(n, q);           // panics on invalid params
let ctx = Ntt32Context::try_new(n, q)?;      // returns Result<_, NttError>

// Forward / Inverse NTT (in-place)
ctx.forward(&mut data);                       // coefficient → NTT domain
ctx.inverse(&mut data);                       // NTT → coefficient (× N⁻¹)
ctx.inverse_lazy(&mut data);                  // NTT → coefficient (no N⁻¹)

// Polynomial multiplication in Z_q[X]/(X^N + 1)
let result = ctx.negacyclic_mul(&a, &b);     // allocating
ctx.negacyclic_mul_into(&mut a, &mut b, &mut result); // zero-allocation

On aarch64, forward/inverse dispatch to NEON automatically. On other architectures, a scalar fallback using Shoup multiplication and Harvey lazy butterflies is used.

`PqNtt`

use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65);
ntt.forward(&mut data);
ntt.inverse(&mut data);
let product = ntt.multiply(&a, &b);

// Available presets:
// PqScheme::MlDsa44  — NIST Level 2 (q=8380417, N=256)
// PqScheme::MlDsa65  — NIST Level 3 (q=8380417, N=256)
// PqScheme::MlDsa87  — NIST Level 5 (q=8380417, N=256)

Utilities

use vaea_ntt::ntt32::{generate_primes_28, is_prime_32, find_primitive_root};

// Generate NTT-friendly primes < 2^28 for a given N
let primes = generate_primes_28(1024, 3); // 3 primes for N=1024

Features

Feature	Default	Description
`std`	✅	Enables `std::error::Error` impl on `NttError`
`rand`	—	Random polynomial generation (`Poly64::new_random()`, etc.)
`ffi`	—	Diplomat FFI bindings (C, C++, JS/WASM)

`no_std` Usage

[dependencies]
vaea-ntt = { version = "0.1", default-features = false }

Requires alloc. Zero runtime dependencies in this configuration.

Performance

Measured with Criterion on Apple M3 Pro (aarch64), --release, single-threaded.

Forward NTT (`ntt32`, q = 12 289)

N	Latency	Throughput
64	66 ns	970 M coeff/s
256	234 ns	1.09 G coeff/s
1 024	1.19 µs	860 M coeff/s
4 096	5.7 µs	719 M coeff/s
8 192	11.4 µs	719 M coeff/s
16 384	27.2 µs	602 M coeff/s
32 768	58.5 µs	560 M coeff/s

Inverse NTT (`ntt32`, q = 12 289)

N	Latency
256	320 ns
1 024	1.55 µs
4 096	7.7 µs
32 768	63.8 µs

Negacyclic Polynomial Multiplication

Two forward NTTs + pointwise multiply + inverse NTT.

N	Total
256	1.08 µs
1 024	4.97 µs
4 096	23.3 µs

Run cargo bench on your hardware for your own numbers. Results vary with hardware and system load. Disable CPU frequency scaling for reproducible measurements.

Architecture

src/
├── ntt32/           # 28-bit NTT pipeline
│   ├── arith.rs     # Branchless modular arithmetic (add, sub, mul, pow, inv)
│   ├── context.rs   # Ntt32Context — unified API with NEON/scalar dispatch
│   ├── neon.rs      # ARM NEON intrinsics (4-stage fused butterflies)
│   ├── scalar.rs    # Portable scalar (Shoup multiplication, Harvey butterfly)
│   └── prime.rs     # NTT-friendly prime generation, primitive root finding
├── ntt64/           # 64-bit NTT pipeline (Barrett + Montgomery)
│   ├── arith.rs     # 64-bit modular arithmetic
│   ├── context.rs   # Ntt64Context
│   └── prime.rs     # 64-bit prime utilities
├── pq.rs            # Post-quantum presets (ML-DSA)
├── poly.rs          # Poly64 — polynomial over Z_q[X]/(X^N+1)
├── rns.rs           # RNS/CRT multi-prime decomposition
├── ffi.rs           # Diplomat FFI bridge
└── lib.rs

Design Rationale

ARM NEON native: 4×u32 lanes. u32 × u32 products fit in u64, no widening to u128.
Lazy reduction: With q < 2²⁸, intermediates 3q < 2³⁰ fit in u32, enabling deferred Barrett reduction across multiple butterfly stages.
PQ aligned: All NIST lattice standards use primes ≤ 23 bits — well within 28 bits.

FHE schemes (CKKS, BGV) use 60–62 bit primes — these don't fit in u32.
ntt64 provides Barrett and Montgomery arithmetic for large primes.
RNS combines multiple ntt64 contexts for multi-precision FHE computation.

Security

Property	Guarantee
Constant-time	All arithmetic uses branchless SIMD masks (`vcgeq` + `vandq`), no data-dependent branches.
Input validation	`try_new()` rejects non-prime `q`, non-power-of-two `N`, and non-NTT-friendly primes.
Memory safety	All NEON accesses are bounds-checked via loop guards. `unsafe` limited to NEON intrinsics.
Thread safety	`Ntt32Context` is `Send + Sync`. Verified with 8 threads × 100 iterations.

See SECURITY.md for the vulnerability disclosure policy.

Testing

# Unit + integration + doc tests
cargo test --release

# Benchmarks
cargo bench --bench ntt32_bench      # NTT32 full scaling suite
cargo bench --bench ntt64_bench      # NTT64 pipeline
cargo bench --bench pq_bench         # Post-quantum presets

# Security & exhaustive validation
cargo run --release --example exhaustive_test          # 2618 test cases
cargo run --release --example verify_no_false_positive # anti-trivial-pass
cargo run --release --example security_exploits        # exploit suite

License

This project is dual-licensed:

Open Source — AGPL-3.0-or-later

Free for open-source projects. See LICENSE.

If you use VaeaNTT in a network service or distribute it, you must release your complete source code under the AGPL. This applies to modified and unmodified usage.

Commercial License

For closed-source, proprietary, or embedded use, a commercial license is available that removes all AGPL obligations.

Contact: alexis@vaea.tech

Table of Contents