vaea-ntt 0.1.1

High-performance Number Theoretic Transform (NTT) for post-quantum cryptography. ARM NEON SIMD native, constant-time, no_std. ML-DSA, Falcon, FHE. Dual-licensed AGPL-3.0 + commercial.
Documentation
<p align="center">
  <h1 align="center">VaeaNTT</h1>
  <p align="center">
    <strong>High-performance Number Theoretic Transform engine for lattice-based cryptography</strong>
  </p>
  <p align="center">
    <a href="https://crates.io/crates/vaea-ntt"><img src="https://img.shields.io/crates/v/vaea-ntt.svg" alt="crates.io"></a>
    <a href="https://docs.rs/vaea-ntt"><img src="https://img.shields.io/docsrs/vaea-ntt" alt="docs.rs"></a>
    <a href="https://github.com/Vaeatech/VaeaNTT/actions"><img src="https://img.shields.io/github/actions/workflow/status/Vaeatech/VaeaNTT/ci.yml?branch=main" alt="CI"></a>
    <a href="LICENSE"><img src="https://img.shields.io/badge/license-AGPL--3.0--or--later-blue.svg" alt="License"></a>
    <img src="https://img.shields.io/badge/MSRV-1.87-orange.svg" alt="MSRV">
    <img src="https://img.shields.io/badge/no__std-compatible-green.svg" alt="no_std">
  </p>
</p>

---

VaeaNTT is a Rust library providing NTT (Number Theoretic Transform) implementations
optimized for **ARM NEON** (aarch64), with a portable scalar fallback for all platforms.

- ๐Ÿš€ **ARM NEON native** โ€” all butterfly stages vectorized with 4-wide `u32` SIMD
- ๐Ÿ”€ **Two pipelines** โ€” 28-bit primes (`ntt32`) and 60โ€“62 bit primes (`ntt64`)
- ๐Ÿ“ฆ **`no_std`** โ€” runs on bare-metal, requires only `alloc`
- ๐Ÿ”’ **Constant-time** โ€” branchless arithmetic, no data-dependent branches
- ๐ŸŽฏ **Runtime-generic** โ€” any NTT-friendly prime, not hardcoded to one scheme
- ๐ŸŒ **Multi-language** โ€” C, C++, JS/WASM bindings via Diplomat FFI

## Table of Contents

- [Quick Start]#quick-start
- [Supported Parameters]#supported-parameters
- [API Reference]#api-reference
- [Performance]#performance
- [Architecture]#architecture
- [Security]#security
- [Testing]#testing
- [License]#license

## Quick Start

Add to your `Cargo.toml`:

```toml
[dependencies]
vaea-ntt = "0.1"
```

### Basic NTT

```rust
use vaea_ntt::ntt32::Ntt32Context;

// Any NTT-friendly prime < 2^28
let ctx = Ntt32Context::new(256, 8_380_417); // ML-DSA prime

let mut data = vec![42u32; 256];
ctx.forward(&mut data);   // Coefficient โ†’ NTT domain
ctx.inverse(&mut data);   // NTT domain โ†’ Coefficient
assert!(data.iter().all(|&x| x == 42));
```

### Post-Quantum Preset (ML-DSA)

```rust
use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65); // NIST Level 3
let mut poly = vec![0u32; 256];
poly[0] = 1;
ntt.forward(&mut poly);
ntt.inverse(&mut poly);
assert_eq!(poly[0], 1);
```

### Polynomial Multiplication

```rust
use vaea_ntt::ntt32::Ntt32Context;

let ctx = Ntt32Context::new(256, 8_380_417);

// (1 + x) ร— (1 + x) = 1 + 2x + xยฒ  in Z_q[X]/(X^256 + 1)
let mut a = vec![0u32; 256];
a[0] = 1; a[1] = 1;
let result = ctx.negacyclic_mul(&a, &a);
assert_eq!(&result[..3], &[1, 2, 1]);
```

## Supported Parameters

VaeaNTT accepts any prime `q` and power-of-two `N` satisfying `q โ‰ก 1 (mod 2N)`.

### `ntt32` โ€” Primes < 2ยฒโธ

| Use Case | q | Bits | Tested N |
|:---------|--:|:----:|:---------|
| **ML-DSA** | 8 380 417 | 23 | 256 |
| **Falcon** | 12 289 | 14 | 512, 1024 |
| **NewHope** | 7 681 | 13 | 512, 1024 |
| **FHE** (CKKS/BGV CRT limbs) | any < 2ยฒโธ | โ‰ค 28 | up to 32 768 |

### `ntt64` โ€” Primes 60โ€“62 bits

For FHE-compatible 64-bit primes. Includes built-in constants for common primes
(`PRIME_SEAL`, `PRIME_60_1`, `PRIME_62_1`, etc.).

> **Note on ML-KEM**: ML-KEM uses q = 3329 with an incomplete NTT
> (size-128 over coefficient pairs), not a standard negacyclic NTT.
> VaeaNTT's standard NTT works with q = 3329 for N โ‰ค 128.
> A dedicated incomplete NTT module for ML-KEM is planned.

## API Reference

### Modules

| Module | Description |
|:-------|:------------|
| [`ntt32`]src/ntt32/ | NTT for primes < 2ยฒโธ. ARM NEON vectorized + scalar fallback. |
| [`ntt64`]src/ntt64/ | NTT for 60โ€“62 bit primes. Barrett and Montgomery arithmetic. |
| [`pq`]src/pq.rs | Post-quantum presets for ML-DSA. |
| [`poly`]src/poly.rs | Polynomial arithmetic over Z_q[X]/(X^N + 1), 64-bit coefficients. |
| [`rns`]src/rns.rs | Residue Number System (multi-prime CRT) for FHE. |
| [`ffi`]src/ffi.rs | FFI bindings via Diplomat (C, C++, JS/WASM). Requires `ffi` feature. |

### `Ntt32Context`

```rust
// Construction
let ctx = Ntt32Context::new(n, q);           // panics on invalid params
let ctx = Ntt32Context::try_new(n, q)?;      // returns Result<_, NttError>

// Forward / Inverse NTT (in-place)
ctx.forward(&mut data);                       // coefficient โ†’ NTT domain
ctx.inverse(&mut data);                       // NTT โ†’ coefficient (ร— Nโปยน)
ctx.inverse_lazy(&mut data);                  // NTT โ†’ coefficient (no Nโปยน)

// Polynomial multiplication in Z_q[X]/(X^N + 1)
let result = ctx.negacyclic_mul(&a, &b);     // allocating
ctx.negacyclic_mul_into(&mut a, &mut b, &mut result); // zero-allocation
```

On `aarch64`, `forward`/`inverse` dispatch to NEON automatically.
On other architectures, a scalar fallback using Shoup multiplication and Harvey lazy butterflies is used.

### `PqNtt`

```rust
use vaea_ntt::pq::{PqScheme, PqNtt};

let ntt = PqNtt::new(PqScheme::MlDsa65);
ntt.forward(&mut data);
ntt.inverse(&mut data);
let product = ntt.multiply(&a, &b);

// Available presets:
// PqScheme::MlDsa44  โ€” NIST Level 2 (q=8380417, N=256)
// PqScheme::MlDsa65  โ€” NIST Level 3 (q=8380417, N=256)
// PqScheme::MlDsa87  โ€” NIST Level 5 (q=8380417, N=256)
```

### Utilities

```rust
use vaea_ntt::ntt32::{generate_primes_28, is_prime_32, find_primitive_root};

// Generate NTT-friendly primes < 2^28 for a given N
let primes = generate_primes_28(1024, 3); // 3 primes for N=1024
```

### Features

| Feature | Default | Description |
|:--------|:-------:|:------------|
| `std` | โœ… | Enables `std::error::Error` impl on `NttError` |
| `rand` | โ€” | Random polynomial generation (`Poly64::new_random()`, etc.) |
| `ffi` | โ€” | Diplomat FFI bindings (C, C++, JS/WASM) |

#### `no_std` Usage

```toml
[dependencies]
vaea-ntt = { version = "0.1", default-features = false }
```

Requires `alloc`. Zero runtime dependencies in this configuration.

## Performance

Measured with [Criterion](https://crates.io/crates/criterion) on **Apple M3 Pro** (aarch64), `--release`, single-threaded.

### Forward NTT (`ntt32`, q = 12 289)

| N | Latency | Throughput |
|-----:|--------:|----------:|
| 64 | **66 ns** | 970 M coeff/s |
| 256 | **234 ns** | 1.09 G coeff/s |
| 1 024 | **1.19 ยตs** | 860 M coeff/s |
| 4 096 | **5.7 ยตs** | 719 M coeff/s |
| 8 192 | **11.4 ยตs** | 719 M coeff/s |
| 16 384 | **27.2 ยตs** | 602 M coeff/s |
| 32 768 | **58.5 ยตs** | 560 M coeff/s |

### Inverse NTT (`ntt32`, q = 12 289)

| N | Latency |
|-----:|--------:|
| 256 | **320 ns** |
| 1 024 | **1.55 ยตs** |
| 4 096 | **7.7 ยตs** |
| 32 768 | **63.8 ยตs** |

### Negacyclic Polynomial Multiplication

Two forward NTTs + pointwise multiply + inverse NTT.

| N | Total |
|-----:|------:|
| 256 | **1.08 ยตs** |
| 1 024 | **4.97 ยตs** |
| 4 096 | **23.3 ยตs** |

> **Run `cargo bench` on your hardware for your own numbers.**
> Results vary with hardware and system load.
> Disable CPU frequency scaling for reproducible measurements.

## Architecture

```
src/
โ”œโ”€โ”€ ntt32/           # 28-bit NTT pipeline
โ”‚   โ”œโ”€โ”€ arith.rs     # Branchless modular arithmetic (add, sub, mul, pow, inv)
โ”‚   โ”œโ”€โ”€ context.rs   # Ntt32Context โ€” unified API with NEON/scalar dispatch
โ”‚   โ”œโ”€โ”€ neon.rs      # ARM NEON intrinsics (4-stage fused butterflies)
โ”‚   โ”œโ”€โ”€ scalar.rs    # Portable scalar (Shoup multiplication, Harvey butterfly)
โ”‚   โ””โ”€โ”€ prime.rs     # NTT-friendly prime generation, primitive root finding
โ”œโ”€โ”€ ntt64/           # 64-bit NTT pipeline (Barrett + Montgomery)
โ”‚   โ”œโ”€โ”€ arith.rs     # 64-bit modular arithmetic
โ”‚   โ”œโ”€โ”€ context.rs   # Ntt64Context
โ”‚   โ””โ”€โ”€ prime.rs     # 64-bit prime utilities
โ”œโ”€โ”€ pq.rs            # Post-quantum presets (ML-DSA)
โ”œโ”€โ”€ poly.rs          # Poly64 โ€” polynomial over Z_q[X]/(X^N+1)
โ”œโ”€โ”€ rns.rs           # RNS/CRT multi-prime decomposition
โ”œโ”€โ”€ ffi.rs           # Diplomat FFI bridge
โ””โ”€โ”€ lib.rs
```

### Design Rationale

<details>
<summary><strong>Why 28-bit primes for <code>ntt32</code>?</strong></summary>

- **ARM NEON native**: 4ร—`u32` lanes. `u32 ร— u32` products fit in `u64`, no widening to `u128`.
- **Lazy reduction**: With `q < 2ยฒโธ`, intermediates `3q < 2ยณโฐ` fit in `u32`, enabling deferred Barrett reduction across multiple butterfly stages.
- **PQ aligned**: All NIST lattice standards use primes โ‰ค 23 bits โ€” well within 28 bits.

</details>

<details>
<summary><strong>Why two separate pipelines?</strong></summary>

- FHE schemes (CKKS, BGV) use 60โ€“62 bit primes โ€” these don't fit in `u32`.
- `ntt64` provides Barrett and Montgomery arithmetic for large primes.
- RNS combines multiple `ntt64` contexts for multi-precision FHE computation.

</details>

## Security

| Property | Guarantee |
|:---------|:----------|
| **Constant-time** | All arithmetic uses branchless SIMD masks (`vcgeq` + `vandq`), no data-dependent branches. |
| **Input validation** | `try_new()` rejects non-prime `q`, non-power-of-two `N`, and non-NTT-friendly primes. |
| **Memory safety** | All NEON accesses are bounds-checked via loop guards. `unsafe` limited to NEON intrinsics. |
| **Thread safety** | `Ntt32Context` is `Send + Sync`. Verified with 8 threads ร— 100 iterations. |

See [SECURITY.md](SECURITY.md) for the vulnerability disclosure policy.

## Testing

```bash
# Unit + integration + doc tests
cargo test --release

# Benchmarks
cargo bench --bench ntt32_bench      # NTT32 full scaling suite
cargo bench --bench ntt64_bench      # NTT64 pipeline
cargo bench --bench pq_bench         # Post-quantum presets

# Security & exhaustive validation
cargo run --release --example exhaustive_test          # 2618 test cases
cargo run --release --example verify_no_false_positive # anti-trivial-pass
cargo run --release --example security_exploits        # exploit suite
```

## License

This project is **dual-licensed**:

### Open Source โ€” AGPL-3.0-or-later

Free for open-source projects. See [LICENSE](LICENSE).

If you use VaeaNTT in a network service or distribute it, you must release your
complete source code under the AGPL. This applies to modified and unmodified usage.

### Commercial License

For closed-source, proprietary, or embedded use, a commercial license is available
that removes all AGPL obligations.

**Contact**: [alexis@vaea.tech](mailto:alexis@vaea.tech)