vaea-ntt 0.1.1

High-performance Number Theoretic Transform (NTT) for post-quantum cryptography. ARM NEON SIMD native, constant-time, no_std. ML-DSA, Falcon, FHE. Dual-licensed AGPL-3.0 + commercial.
Documentation

VaeaNTT is a Rust library providing NTT (Number Theoretic Transform) implementations optimized for ARM NEON (aarch64), with a portable scalar fallback for all platforms.

  • ๐Ÿš€ ARM NEON native โ€” all butterfly stages vectorized with 4-wide u32 SIMD
  • ๐Ÿ”€ Two pipelines โ€” 28-bit primes (ntt32) and 60โ€“62 bit primes (ntt64)
  • ๐Ÿ“ฆ no_std โ€” runs on bare-metal, requires only alloc
  • ๐Ÿ”’ Constant-time โ€” branchless arithmetic, no data-dependent branches
  • ๐ŸŽฏ Runtime-generic โ€” any NTT-friendly prime, not hardcoded to one scheme
  • ๐ŸŒ Multi-language โ€” C, C++, JS/WASM bindings via Diplomat FFI

Table of Contents

Quick Start

Add to your Cargo.toml:

[dependencies]
vaea-ntt = "0.1"

Basic NTT

use vaea_ntt::ntt32::Ntt32Context;

// Any NTT-friendly prime < 2^28
let ctx = Ntt32Context::new(256, 8_380_417); // ML-DSA prime

let mut data = vec![42u32; 256];
ctx.forward(&mut data);   // Coefficient โ†’ NTT domain
ctx.inverse(&mut data);   // NTT domain โ†’ Coefficient
assert!(data.iter().all(|&x| x == 42));

Post-Quantum Preset (ML-DSA)

use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65); // NIST Level 3
let mut poly = vec![0u32; 256];
poly[0] = 1;
ntt.forward(&mut poly);
ntt.inverse(&mut poly);
assert_eq!(poly[0], 1);

Polynomial Multiplication

use vaea_ntt::ntt32::Ntt32Context;

let ctx = Ntt32Context::new(256, 8_380_417);

// (1 + x) ร— (1 + x) = 1 + 2x + xยฒ  in Z_q[X]/(X^256 + 1)
let mut a = vec![0u32; 256];
a[0] = 1; a[1] = 1;
let result = ctx.negacyclic_mul(&a, &a);
assert_eq!(&result[..3], &[1, 2, 1]);

Supported Parameters

VaeaNTT accepts any prime q and power-of-two N satisfying q โ‰ก 1 (mod 2N).

ntt32 โ€” Primes < 2ยฒโธ

Use Case q Bits Tested N
ML-DSA 8 380 417 23 256
Falcon 12 289 14 512, 1024
NewHope 7 681 13 512, 1024
FHE (CKKS/BGV CRT limbs) any < 2ยฒโธ โ‰ค 28 up to 32 768

ntt64 โ€” Primes 60โ€“62 bits

For FHE-compatible 64-bit primes. Includes built-in constants for common primes (PRIME_SEAL, PRIME_60_1, PRIME_62_1, etc.).

Note on ML-KEM: ML-KEM uses q = 3329 with an incomplete NTT (size-128 over coefficient pairs), not a standard negacyclic NTT. VaeaNTT's standard NTT works with q = 3329 for N โ‰ค 128. A dedicated incomplete NTT module for ML-KEM is planned.

API Reference

Modules

Module Description
ntt32 NTT for primes < 2ยฒโธ. ARM NEON vectorized + scalar fallback.
ntt64 NTT for 60โ€“62 bit primes. Barrett and Montgomery arithmetic.
pq Post-quantum presets for ML-DSA.
poly Polynomial arithmetic over Z_q[X]/(X^N + 1), 64-bit coefficients.
rns Residue Number System (multi-prime CRT) for FHE.
ffi FFI bindings via Diplomat (C, C++, JS/WASM). Requires ffi feature.

Ntt32Context

// Construction
let ctx = Ntt32Context::new(n, q);           // panics on invalid params
let ctx = Ntt32Context::try_new(n, q)?;      // returns Result<_, NttError>

// Forward / Inverse NTT (in-place)
ctx.forward(&mut data);                       // coefficient โ†’ NTT domain
ctx.inverse(&mut data);                       // NTT โ†’ coefficient (ร— Nโปยน)
ctx.inverse_lazy(&mut data);                  // NTT โ†’ coefficient (no Nโปยน)

// Polynomial multiplication in Z_q[X]/(X^N + 1)
let result = ctx.negacyclic_mul(&a, &b);     // allocating
ctx.negacyclic_mul_into(&mut a, &mut b, &mut result); // zero-allocation

On aarch64, forward/inverse dispatch to NEON automatically. On other architectures, a scalar fallback using Shoup multiplication and Harvey lazy butterflies is used.

PqNtt

use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65);
ntt.forward(&mut data);
ntt.inverse(&mut data);
let product = ntt.multiply(&a, &b);

// Available presets:
// PqScheme::MlDsa44  โ€” NIST Level 2 (q=8380417, N=256)
// PqScheme::MlDsa65  โ€” NIST Level 3 (q=8380417, N=256)
// PqScheme::MlDsa87  โ€” NIST Level 5 (q=8380417, N=256)

Utilities

use vaea_ntt::ntt32::{generate_primes_28, is_prime_32, find_primitive_root};

// Generate NTT-friendly primes < 2^28 for a given N
let primes = generate_primes_28(1024, 3); // 3 primes for N=1024

Features

Feature Default Description
std โœ… Enables std::error::Error impl on NttError
rand โ€” Random polynomial generation (Poly64::new_random(), etc.)
ffi โ€” Diplomat FFI bindings (C, C++, JS/WASM)

no_std Usage

[dependencies]
vaea-ntt = { version = "0.1", default-features = false }

Requires alloc. Zero runtime dependencies in this configuration.

Performance

Measured with Criterion on Apple M3 Pro (aarch64), --release, single-threaded.

Forward NTT (ntt32, q = 12 289)

N Latency Throughput
64 66 ns 970 M coeff/s
256 234 ns 1.09 G coeff/s
1 024 1.19 ยตs 860 M coeff/s
4 096 5.7 ยตs 719 M coeff/s
8 192 11.4 ยตs 719 M coeff/s
16 384 27.2 ยตs 602 M coeff/s
32 768 58.5 ยตs 560 M coeff/s

Inverse NTT (ntt32, q = 12 289)

N Latency
256 320 ns
1 024 1.55 ยตs
4 096 7.7 ยตs
32 768 63.8 ยตs

Negacyclic Polynomial Multiplication

Two forward NTTs + pointwise multiply + inverse NTT.

N Total
256 1.08 ยตs
1 024 4.97 ยตs
4 096 23.3 ยตs

Run cargo bench on your hardware for your own numbers. Results vary with hardware and system load. Disable CPU frequency scaling for reproducible measurements.

Architecture

src/
โ”œโ”€โ”€ ntt32/           # 28-bit NTT pipeline
โ”‚   โ”œโ”€โ”€ arith.rs     # Branchless modular arithmetic (add, sub, mul, pow, inv)
โ”‚   โ”œโ”€โ”€ context.rs   # Ntt32Context โ€” unified API with NEON/scalar dispatch
โ”‚   โ”œโ”€โ”€ neon.rs      # ARM NEON intrinsics (4-stage fused butterflies)
โ”‚   โ”œโ”€โ”€ scalar.rs    # Portable scalar (Shoup multiplication, Harvey butterfly)
โ”‚   โ””โ”€โ”€ prime.rs     # NTT-friendly prime generation, primitive root finding
โ”œโ”€โ”€ ntt64/           # 64-bit NTT pipeline (Barrett + Montgomery)
โ”‚   โ”œโ”€โ”€ arith.rs     # 64-bit modular arithmetic
โ”‚   โ”œโ”€โ”€ context.rs   # Ntt64Context
โ”‚   โ””โ”€โ”€ prime.rs     # 64-bit prime utilities
โ”œโ”€โ”€ pq.rs            # Post-quantum presets (ML-DSA)
โ”œโ”€โ”€ poly.rs          # Poly64 โ€” polynomial over Z_q[X]/(X^N+1)
โ”œโ”€โ”€ rns.rs           # RNS/CRT multi-prime decomposition
โ”œโ”€โ”€ ffi.rs           # Diplomat FFI bridge
โ””โ”€โ”€ lib.rs

Design Rationale

  • ARM NEON native: 4ร—u32 lanes. u32 ร— u32 products fit in u64, no widening to u128.
  • Lazy reduction: With q < 2ยฒโธ, intermediates 3q < 2ยณโฐ fit in u32, enabling deferred Barrett reduction across multiple butterfly stages.
  • PQ aligned: All NIST lattice standards use primes โ‰ค 23 bits โ€” well within 28 bits.
  • FHE schemes (CKKS, BGV) use 60โ€“62 bit primes โ€” these don't fit in u32.
  • ntt64 provides Barrett and Montgomery arithmetic for large primes.
  • RNS combines multiple ntt64 contexts for multi-precision FHE computation.

Security

Property Guarantee
Constant-time All arithmetic uses branchless SIMD masks (vcgeq + vandq), no data-dependent branches.
Input validation try_new() rejects non-prime q, non-power-of-two N, and non-NTT-friendly primes.
Memory safety All NEON accesses are bounds-checked via loop guards. unsafe limited to NEON intrinsics.
Thread safety Ntt32Context is Send + Sync. Verified with 8 threads ร— 100 iterations.

See SECURITY.md for the vulnerability disclosure policy.

Testing

# Unit + integration + doc tests
cargo test --release

# Benchmarks
cargo bench --bench ntt32_bench      # NTT32 full scaling suite
cargo bench --bench ntt64_bench      # NTT64 pipeline
cargo bench --bench pq_bench         # Post-quantum presets

# Security & exhaustive validation
cargo run --release --example exhaustive_test          # 2618 test cases
cargo run --release --example verify_no_false_positive # anti-trivial-pass
cargo run --release --example security_exploits        # exploit suite

License

This project is dual-licensed:

Open Source โ€” AGPL-3.0-or-later

Free for open-source projects. See LICENSE.

If you use VaeaNTT in a network service or distribute it, you must release your complete source code under the AGPL. This applies to modified and unmodified usage.

Commercial License

For closed-source, proprietary, or embedded use, a commercial license is available that removes all AGPL obligations.

Contact: alexis@vaea.tech