hekate-math
Copyright (c) Andrei Kochergin and Oumuamua Labs.
Hardware-accelerated binary tower fields for zero-knowledge proofs.
hekate-math provides a high-performance, constant-time implementation of binary tower fields (𝔽(2k))
optimized for GKR-based provers, Sumcheck, and Binius protocols. The library implements a rigorous algebraic tower
construction up to 𝔽(2256), leveraging basis isomorphism to utilize native CPU hardware instructions:
PMULL (ARMv8 NEON) and PCLMULQDQ (x86_64 AVX2).
Designed for low-level cryptographic engineering, the crate is no-std compatible and defaults to constant-time
execution paths to mitigate side-channel attacks. It enforces strict type safety between canonical (tower) and
polynomial (flat/hardware) representations.
This is the mathematical core of the Hekate ZK Engine.
Performance Metrics
[!NOTE] Current benchmarks are reported with the
table-mathfeature enabled to reflect peak performance for public-data scenarios. For private-key operations, use the default constant-time backend.
Benchmarks executed on Apple M3 Max (aarch64). The library achieves near-native memory bandwidth saturation and single-cycle throughput for hardware-accelerated operations.
Micro-Benchmarks (Block128)
| Operation | Basis | Latency | Implementation |
|---|---|---|---|
| Multiplication | Polynomial (Flat) | 1.08 ns | PMULL (Pipelined) |
| Multiplication | Tower (Canonical) | 98.3 ns | Recursive Karatsuba |
| Addition | Any | 1.14 ns | Vectorized XOR |
| Inversion (Single) | Tower | 246.6 ns | Itoh-Tsujii / Fermat Little Theorem |
| Inversion (Batch) | Tower | 15.7 ns | Montgomery's Trick (SIMD) |
| Basis Conv (Default) | Tower ↔ Flat | 90.0 ns | Bit-Slicing (Constant-Time) |
| Basis Conv (Fast) | Tower ↔ Flat | 3.80 ns | Look-Up Table (Variable-Time) |
Impact: Flat basis multiplication is approximately 100x faster than the canonical recursive implementation.
Polynomial Arithmetic (Poly ALU)
Efficiency of polynomial operations in 𝔽(2^128).
| Operation | Scenario / Size | Time | Throughput |
|---|---|---|---|
| Dense Eval (Tower) | 2²⁰ coeffs | 91.93 ms | 174 MiB/s |
| Dense Eval (Hardware) | 2²⁰ coeffs | 8.34 ms | 1.87 GiB/s |
| Batch Eval (SIMD) | 256 × 16384 | 5.43 ms | 772 Melem/s |
| Additive FFT (scalar) | 2¹⁶ · Block16 | 1.31 ms | 50 Melem/s |
| Additive FFT (packed) | 2¹⁶ · ×8 lanes | 3.16 ms | 166 Melem/s |
| Interpolate MSM | 65536 points | 77.12 µs | 850 Melem/s |
| MLE Evaluation | 20 variables | 1.27 ms | 822 Melem/s |
Sparse Matrix-Vector Multiplication (SpMV)
Benchmarks for Block128 SpMV with fixed degree 16 (typical for Brakedown/Binius).
| Matrix Size | Time (M3 Max) | Throughput | Memory Bandwidth (est.) |
|---|---|---|---|
| 64K Rows | ~171 µs | 6.14 Gelem/s | ~98 GB/s |
| 256K Rows | ~628 µs | 6.68 Gelem/s | ~107 GB/s |
| 1M Rows | ~5.17 ms | 3.25 Gelem/s | ~52 GB/s |
Installation
[]
= "0.8.0"
Examples
Basics: Field Arithmetic
- Addition is equivalent to XOR (
^). - Subtraction is identical to Addition (since -x = x).
- 1 + 1 = 0. This is the defining property of Characteristic 2 fields.
use ;
The Isomorphic Workflow
Most ZK protocols require transitioning between the Canonical Basis (for recursive folding/sumcheck) and the Polynomial Basis (for heavy arithmetic).
use ;
SIMD Vectorization
For throughput-critical paths, hekate-math provides explicit SIMD packing via the PackableField trait.
use ;
Sparse Matrix-Vector Multiplication (SpMV)
A core primitive for Brakedown, Binius, and linear-code based commitments. ByteSparseMatrix holds
binary weights (0/1) packed as u8 and applies them to a hardware-basis vector by branch-gated
XOR, no per-weight field multiply. The vector comes from any VectorSource, so dense slices and
zero-allocation algorithmic generators share a single code path.
Random expander sampling, drawing the matrix from a seed, lives in the consumer, not here: which oracle samples the code is a PCS soundness parameter, not field arithmetic.
use ByteSparseMatrix;
use ;
Roadmap
The immediate engineering focus is establishing absolute hardware supremacy across both ARM and x86 backends.
-
x86_64 Hardware Acceleration (Beta → Prod)
- Replace software fallbacks with hand-tuned assembly/intrinsics for AVX2 and PCLMULQDQ.
- Goal: Path to x86_64 Supremacy.
-
Formal Verification & Execution Path Auditing
- Mathematical modeling of execution boundaries and DoS-resistant state transitions.
- Goal: Enforce strict
Resultpropagation across all public interfaces for enterprise-grade fault tolerance.
Theoretical Foundation
hekate-math implements a binary tower field architecture. The field 𝔽(2^128)
is constructed via recursive quadratic extensions using the reduction polynomial v² + v + βᵢ.
The Tower Hierarchy
The construction follows a strict recursive data layout. Higher-order blocks are composed of two lower-order blocks (Low, High).
Block256 (GF(2^256))
/ \
Block128 Block128 (GF(2^128))
/ \ / \
Block64 Block64 ... ...
/ \
Block32 Block32
/ \
Block16 Block16
/ \
Block8 Block8 (Base Field GF(2^8))
|
[Bit; 8] (Atomic Unit GF(2))
Algebraic Construction
The extension defines 𝔽(2(2(i+1))) ≅ 𝔽(2(2i))[v] / (v² + v + βᵢ),
where βᵢ is the extension constant (EXTENSION_TAU) for that level.
| Height | Field | Implementation | Extension Constant (β) | Arithmetic |
|---|---|---|---|---|
| h=0 | 𝔽₂ | Bit |
N/A | Boolean (XOR/AND) |
| h=3 | 𝔽(2^8) | Block8 |
Base Field (AES Poly) | Recursive / Karatsuba |
| h=4 | 𝔽(2^16) | Block16 |
0x20 ∈ Block8 | Recursive / Karatsuba |
| h=5 | 𝔽(2^32) | Block32 |
0x2000 ∈ Block16 | Recursive / Karatsuba |
| h=6 | 𝔽(2^64) | Block64 |
0x20000000 ∈ Block32 | Recursive / Karatsuba |
| h=7 | 𝔽(2^128) | Block128 |
0x2000000000000000 ∈ Block64 | Recursive / Karatsuba |
| h=8 | 𝔽(2^256) | Block256 |
0x20000000000000000000000000000000 ∈ Block128 | Recursive / Karatsuba |
Note: The tower is rooted at F(2^8) (AES Field) for hardware compatibility. Lower fields (Bit) are subfields embedded via isomorphism, making this a Hybrid Tower construction.
The Isomorphic Basis Architecture
To bridge the gap between algebraic recursion and CPU pipeline efficiency, hekate-math implements a hybrid basis
system. Canonical values stay in F, while hardware/polynomial values are represented explicitly as Flat<F>.
Canonical Basis (Tower)
The default representation optimized for recursive algebraic operations (e.g., Sumcheck, GKR Layer folding). Elements are structured as linear polynomials A(v) = a₁v + a₀ over the subfield.
- Structure: Recursive coefficients (a_hi, a_lo).
- Operation: Karatsuba Multiplication (3 sub-multiplications).
- Memory: Standard layout (Little-Endian).
Polynomial Basis (Flat)
An isomorphic representation mapping the tower structure to a dense polynomial basis (1, x, x²...) optimized for specific CPU instruction sets (AES-NI, PMULL, PCLMULQDQ).
- Structure: Linear bit-packed integers (
u8,u64,u128). - Operation: Single-cycle Carry-Less Multiplication (
CLMUL) with hardware-accelerated reduction. - Throughput: 1.17ns per multiplication (Block128 on modern architectures).
Isomorphism & Interop
The library strictly enforces basis separation through the type system to prevent mixing representations.
The Isomorphism φ is defined as: φ: 𝔽(Tower) ↔ 𝔽(Hardware)
Change-of-basis matrices are pre-computed constant-time bit-sliced operations by default, with an optional
table-math feature for cached lookups.
Implementation Details & Safety
hekate-math prioritizes correctness and side-channel resistance over "naive" speed, enforcing strict memory layouts
and algorithmic choices.
Memory Layout & Type Safety
Field elements are zero-cost wrappers around native integer types, ensuring ABI compatibility and predictable register usage.
- Scalar Storage:
#[repr(transparent)]structs wrappingu8,u16,u32,u64,u128. - Vector Storage:
#[repr(C, align(N))]SIMD-packed structs (e.g.,PackedBlock128is 32-byte aligned for AVX2/NEON compliance). - Safe Rust:
unsafeis restricted to SIMD intrinsics and bounds-checked lookups. Isomorphisms are checked via theHardwareFieldtrait system.
Security Model
The library operates under a configurable security model designed for cryptographic contexts where secret-dependent execution time is catastrophic.
| Feature Flag | Behavior | Use Case | Security |
|---|---|---|---|
default-features |
Bitsliced Constant-Time | Private Key / Prover | High (Side-Channel Resistant) |
table-math |
Cached Lookup Tables | Public Verifier / Rollup | Low (Variable Access Time) |
table-math |
Cached Lifting Tables | Public Data Ingestion | Low (Variable Access Time) |
- Basis Conversion: By default, φ and φ⁻¹ are computed using constant-time bit-sliced matrix multiplication, independent of the input value.
- Hardware Arithmetic:
Block128multiplication utilizes carry-less multiplication instructions (PMULLon ARMv8,PCLMULQDQon x86_64), which are constant-latency on modern microarchitectures.
Hardware Support
| Architecture | Feature Requirement | Instructions Used | Status |
|---|---|---|---|
| aarch64 | neon, pmull |
vmull_p64, veorq_u8 |
Production |
| x86_64 | N/A | xor, sw_mul |
Development |
| WASM | simd128 |
v128.xor, sw_mul |
Software Fallback |
Note: Native AVX2/PCLMULQDQ implementation for x86_64 is on the roadmap.
Reproduce benchmarks
[!IMPORTANT] Hardware arithmetic performance (e.g., mul_hardware, add_hardware) remains identical regardless of the
table-mathfeature. This feature specifically optimizes the Isomorphism (basis conversion) and Lifting operations. The actual field arithmetic in the flat basis always utilizes the fastest available hardware instructions (PMULL / PCLMULQDQ).
Secure (Default)
Uses constant-time bitsliced matrix multiplication for basis conversion and lifting:
Fast (table-math)
Uses precomputed lookup tables for basis conversion:
Security & Audits
[!WARNING] This implementation is currently UNAUDITED.
It is provided "AS IS" with ABSOLUTELY NO WARRANTY under the terms of the Apache 2.0 License. The authors assume zero liability for any damages arising from its use in production environments.
License
Licensed under Apache 2.0. See the LICENSE and NOTICE files for details.