<p align="center">
<h1 align="center">VaeaNTT</h1>
<p align="center">
<strong>High-performance Number Theoretic Transform engine for lattice-based cryptography</strong>
</p>
<p align="center">
<a href="https://crates.io/crates/vaea-ntt"><img src="https://img.shields.io/crates/v/vaea-ntt.svg" alt="crates.io"></a>
<a href="https://docs.rs/vaea-ntt"><img src="https://img.shields.io/docsrs/vaea-ntt" alt="docs.rs"></a>
<a href="https://github.com/Vaeatech/VaeaNTT/actions"><img src="https://img.shields.io/github/actions/workflow/status/Vaeatech/VaeaNTT/ci.yml?branch=main" alt="CI"></a>
<a href="LICENSE"><img src="https://img.shields.io/badge/license-AGPL--3.0--or--later-blue.svg" alt="License"></a>
<img src="https://img.shields.io/badge/MSRV-1.87-orange.svg" alt="MSRV">
<img src="https://img.shields.io/badge/no__std-compatible-green.svg" alt="no_std">
</p>
</p>
---
VaeaNTT is a Rust library providing NTT (Number Theoretic Transform) implementations
optimized for **ARM NEON** (aarch64), with a portable scalar fallback for all platforms.
- ๐ **ARM NEON native** โ all butterfly stages vectorized with 4-wide `u32` SIMD
- ๐ **Two pipelines** โ 28-bit primes (`ntt32`) and 60โ62 bit primes (`ntt64`)
- ๐ฆ **`no_std`** โ runs on bare-metal, requires only `alloc`
- ๐ **Constant-time** โ branchless arithmetic, no data-dependent branches
- ๐ฏ **Runtime-generic** โ any NTT-friendly prime, not hardcoded to one scheme
- ๐ **Multi-language** โ C, C++, JS/WASM bindings via Diplomat FFI
## Table of Contents
- [Quick Start](#quick-start)
- [Supported Parameters](#supported-parameters)
- [API Reference](#api-reference)
- [Performance](#performance)
- [Architecture](#architecture)
- [Security](#security)
- [Testing](#testing)
- [License](#license)
## Quick Start
Add to your `Cargo.toml`:
```toml
[dependencies]
vaea-ntt = "0.1"
```
### Basic NTT
```rust
use vaea_ntt::ntt32::Ntt32Context;
// Any NTT-friendly prime < 2^28
let ctx = Ntt32Context::new(256, 8_380_417); // ML-DSA prime
let mut data = vec![42u32; 256];
ctx.forward(&mut data); // Coefficient โ NTT domain
ctx.inverse(&mut data); // NTT domain โ Coefficient
### Post-Quantum Preset (ML-DSA)
```rust
use vaea_ntt::pq::{PqScheme, PqNtt};
let ntt = PqNtt::new(PqScheme::MlDsa65); // NIST Level 3
let mut poly = vec![0u32; 256];
poly[0] = 1;
ntt.forward(&mut poly);
ntt.inverse(&mut poly);
assert_eq!(poly[0], 1);
```
### Polynomial Multiplication
```rust
use vaea_ntt::ntt32::Ntt32Context;
let ctx = Ntt32Context::new(256, 8_380_417);
// (1 + x) ร (1 + x) = 1 + 2x + xยฒ in Z_q[X]/(X^256 + 1)
let mut a = vec![0u32; 256];
a[0] = 1; a[1] = 1;
let result = ctx.negacyclic_mul(&a, &a);
assert_eq!(&result[..3], &[1, 2, 1]);
```
## Supported Parameters
VaeaNTT accepts any prime `q` and power-of-two `N` satisfying `q โก 1 (mod 2N)`.
### `ntt32` โ Primes < 2ยฒโธ
| **ML-DSA** | 8 380 417 | 23 | 256 |
| **Falcon** | 12 289 | 14 | 512, 1024 |
| **NewHope** | 7 681 | 13 | 512, 1024 |
| **FHE** (CKKS/BGV CRT limbs) | any < 2ยฒโธ | โค 28 | up to 32 768 |
### `ntt64` โ Primes 60โ62 bits
For FHE-compatible 64-bit primes. Includes built-in constants for common primes
(`PRIME_SEAL`, `PRIME_60_1`, `PRIME_62_1`, etc.).
> **Note on ML-KEM**: ML-KEM uses q = 3329 with an incomplete NTT
> (size-128 over coefficient pairs), not a standard negacyclic NTT.
> VaeaNTT's standard NTT works with q = 3329 for N โค 128.
> A dedicated incomplete NTT module for ML-KEM is planned.
## API Reference
### Modules
| [`ntt32`](src/ntt32/) | NTT for primes < 2ยฒโธ. ARM NEON vectorized + scalar fallback. |
| [`ntt64`](src/ntt64/) | NTT for 60โ62 bit primes. Barrett and Montgomery arithmetic. |
| [`pq`](src/pq.rs) | Post-quantum presets for ML-DSA. |
| [`poly`](src/poly.rs) | Polynomial arithmetic over Z_q[X]/(X^N + 1), 64-bit coefficients. |
| [`rns`](src/rns.rs) | Residue Number System (multi-prime CRT) for FHE. |
| [`ffi`](src/ffi.rs) | FFI bindings via Diplomat (C, C++, JS/WASM). Requires `ffi` feature. |
### `Ntt32Context`
```rust
// Construction
let ctx = Ntt32Context::new(n, q); // panics on invalid params
let ctx = Ntt32Context::try_new(n, q)?; // returns Result<_, NttError>
// Forward / Inverse NTT (in-place)
ctx.forward(&mut data); // coefficient โ NTT domain
ctx.inverse(&mut data); // NTT โ coefficient (ร Nโปยน)
ctx.inverse_lazy(&mut data); // NTT โ coefficient (no Nโปยน)
// Polynomial multiplication in Z_q[X]/(X^N + 1)
let result = ctx.negacyclic_mul(&a, &b); // allocating
ctx.negacyclic_mul_into(&mut a, &mut b, &mut result); // zero-allocation
```
On `aarch64`, `forward`/`inverse` dispatch to NEON automatically.
On other architectures, a scalar fallback using Shoup multiplication and Harvey lazy butterflies is used.
### `PqNtt`
```rust
use vaea_ntt::pq::{PqScheme, PqNtt};
let ntt = PqNtt::new(PqScheme::MlDsa65);
ntt.forward(&mut data);
ntt.inverse(&mut data);
let product = ntt.multiply(&a, &b);
// Available presets:
// PqScheme::MlDsa44 โ NIST Level 2 (q=8380417, N=256)
// PqScheme::MlDsa65 โ NIST Level 3 (q=8380417, N=256)
// PqScheme::MlDsa87 โ NIST Level 5 (q=8380417, N=256)
```
### Utilities
```rust
use vaea_ntt::ntt32::{generate_primes_28, is_prime_32, find_primitive_root};
// Generate NTT-friendly primes < 2^28 for a given N
let primes = generate_primes_28(1024, 3); // 3 primes for N=1024
```
### Features
| `std` | โ
| Enables `std::error::Error` impl on `NttError` |
| `rand` | โ | Random polynomial generation (`Poly64::new_random()`, etc.) |
| `ffi` | โ | Diplomat FFI bindings (C, C++, JS/WASM) |
#### `no_std` Usage
```toml
[dependencies]
vaea-ntt = { version = "0.1", default-features = false }
```
Requires `alloc`. Zero runtime dependencies in this configuration.
## Performance
Measured with [Criterion](https://crates.io/crates/criterion) on **Apple M3 Pro** (aarch64), `--release`, single-threaded.
### Forward NTT (`ntt32`, q = 12 289)
| 64 | **66 ns** | 970 M coeff/s |
| 256 | **234 ns** | 1.09 G coeff/s |
| 1 024 | **1.19 ยตs** | 860 M coeff/s |
| 4 096 | **5.7 ยตs** | 719 M coeff/s |
| 8 192 | **11.4 ยตs** | 719 M coeff/s |
| 16 384 | **27.2 ยตs** | 602 M coeff/s |
| 32 768 | **58.5 ยตs** | 560 M coeff/s |
### Inverse NTT (`ntt32`, q = 12 289)
| 256 | **320 ns** |
| 1 024 | **1.55 ยตs** |
| 4 096 | **7.7 ยตs** |
| 32 768 | **63.8 ยตs** |
### Negacyclic Polynomial Multiplication
Two forward NTTs + pointwise multiply + inverse NTT.
| 256 | **1.08 ยตs** |
| 1 024 | **4.97 ยตs** |
| 4 096 | **23.3 ยตs** |
> **Run `cargo bench` on your hardware for your own numbers.**
> Results vary with hardware and system load.
> Disable CPU frequency scaling for reproducible measurements.
## Architecture
```
src/
โโโ ntt32/ # 28-bit NTT pipeline
โ โโโ arith.rs # Branchless modular arithmetic (add, sub, mul, pow, inv)
โ โโโ context.rs # Ntt32Context โ unified API with NEON/scalar dispatch
โ โโโ neon.rs # ARM NEON intrinsics (4-stage fused butterflies)
โ โโโ scalar.rs # Portable scalar (Shoup multiplication, Harvey butterfly)
โ โโโ prime.rs # NTT-friendly prime generation, primitive root finding
โโโ ntt64/ # 64-bit NTT pipeline (Barrett + Montgomery)
โ โโโ arith.rs # 64-bit modular arithmetic
โ โโโ context.rs # Ntt64Context
โ โโโ prime.rs # 64-bit prime utilities
โโโ pq.rs # Post-quantum presets (ML-DSA)
โโโ poly.rs # Poly64 โ polynomial over Z_q[X]/(X^N+1)
โโโ rns.rs # RNS/CRT multi-prime decomposition
โโโ ffi.rs # Diplomat FFI bridge
โโโ lib.rs
```
### Design Rationale
<details>
<summary><strong>Why 28-bit primes for <code>ntt32</code>?</strong></summary>
- **ARM NEON native**: 4ร`u32` lanes. `u32 ร u32` products fit in `u64`, no widening to `u128`.
- **Lazy reduction**: With `q < 2ยฒโธ`, intermediates `3q < 2ยณโฐ` fit in `u32`, enabling deferred Barrett reduction across multiple butterfly stages.
- **PQ aligned**: All NIST lattice standards use primes โค 23 bits โ well within 28 bits.
</details>
<details>
<summary><strong>Why two separate pipelines?</strong></summary>
- FHE schemes (CKKS, BGV) use 60โ62 bit primes โ these don't fit in `u32`.
- `ntt64` provides Barrett and Montgomery arithmetic for large primes.
- RNS combines multiple `ntt64` contexts for multi-precision FHE computation.
</details>
## Security
| **Constant-time** | All arithmetic uses branchless SIMD masks (`vcgeq` + `vandq`), no data-dependent branches. |
| **Input validation** | `try_new()` rejects non-prime `q`, non-power-of-two `N`, and non-NTT-friendly primes. |
| **Memory safety** | All NEON accesses are bounds-checked via loop guards. `unsafe` limited to NEON intrinsics. |
| **Thread safety** | `Ntt32Context` is `Send + Sync`. Verified with 8 threads ร 100 iterations. |
See [SECURITY.md](SECURITY.md) for the vulnerability disclosure policy.
## Testing
```bash
# Unit + integration + doc tests
cargo test --release
# Benchmarks
cargo bench --bench ntt32_bench # NTT32 full scaling suite
cargo bench --bench ntt64_bench # NTT64 pipeline
cargo bench --bench pq_bench # Post-quantum presets
# Security & exhaustive validation
cargo run --release --example exhaustive_test # 2618 test cases
cargo run --release --example verify_no_false_positive # anti-trivial-pass
cargo run --release --example security_exploits # exploit suite
```
## License
This project is **dual-licensed**:
### Open Source โ AGPL-3.0-or-later
Free for open-source projects. See [LICENSE](LICENSE).
If you use VaeaNTT in a network service or distribute it, you must release your
complete source code under the AGPL. This applies to modified and unmodified usage.
### Commercial License
For closed-source, proprietary, or embedded use, a commercial license is available
that removes all AGPL obligations.
**Contact**: [alexis@vaea.tech](mailto:alexis@vaea.tech)