hekate-math
Copyright (c) Andrei Kochergin and Oumuamua Labs.
Hardware-accelerated binary tower fields for zero-knowledge proofs.
hekate-math provides a high-performance, constant-time implementation of binary tower fields (𝔽(2k))
optimized for GKR-based provers, Sumcheck, and Binius protocols. The library implements a rigorous algebraic tower
construction up to 𝔽(2128), leveraging basis isomorphism to utilize native CPU hardware instructions:
PMULL (ARMv8 NEON) and PCLMULQDQ (x86_64 AVX2).
Designed for low-level cryptographic engineering, the crate is no-std compatible and defaults to constant-time
execution paths to mitigate side-channel attacks. It enforces strict type safety between canonical (tower) and
polynomial (flat/hardware) representations.
This is the mathematical core of the Hekate ZK Engine.
Performance Metrics
[!NOTE] Current benchmarks are reported with the
table-mathfeature enabled to reflect peak performance for public-data scenarios. For private-key operations, use the default constant-time backend.
Benchmarks executed on Apple M3 Max (aarch64). The library achieves near-native memory bandwidth saturation and single-cycle throughput for hardware-accelerated operations.
Micro-Benchmarks (Block128)
| Operation | Basis | Latency | Implementation |
|---|---|---|---|
| Multiplication | Polynomial (Flat) | 1.08 ns | PMULL (Pipelined) |
| Multiplication | Tower (Canonical) | 98.3 ns | Recursive Karatsuba |
| Addition | Any | 1.14 ns | Vectorized XOR |
| Inversion (Single) | Tower | 246.6 ns | Itoh-Tsujii / Fermat Little Theorem |
| Inversion (Batch) | Tower | 15.7 ns | Montgomery's Trick (SIMD) |
| Basis Conv (Default) | Tower ↔ Flat | 90.0 ns | Bit-Slicing (Constant-Time) |
| Basis Conv (Fast) | Tower ↔ Flat | 3.80 ns | Look-Up Table (Variable-Time) |
Impact: Flat basis multiplication is approximately 100x faster than the canonical recursive implementation.
Polynomial Arithmetic (Poly ALU)
Efficiency of polynomial operations in 𝔽(2^128).
| Operation | Scenario / Size | Time | Throughput |
|---|---|---|---|
| Dense Eval (Tower) | 2²⁰ coeffs | 91.93 ms | 174 MiB/s |
| Dense Eval (Hardware) | 2²⁰ coeffs | 8.34 ms | 1.87 GiB/s |
| Batch Eval (SIMD) | 256 × 16384 | 5.43 ms | 772 Melem/s |
| FFT Layer (RAM) | 2²⁰ elements | 909 µs | 1.15 Gelem/s |
| FFT Layer (L1) | 256 elements | 241 ns | 1.06 Gelem/s |
| Interpolate MSM | 65536 points | 77.12 µs | 850 Melem/s |
| MLE Evaluation | 20 variables | 1.27 ms | 822 Melem/s |
Sparse Matrix-Vector Multiplication (SpMV)
Benchmarks for Block128 SpMV with fixed degree 16 (typical for Brakedown/Binius).
| Matrix Size | Time (M3 Max) | Throughput | Memory Bandwidth (est.) |
|---|---|---|---|
| 64K Rows | ~284 µs | 3.69 Gelem/s | ~59 GB/s |
| 256K Rows | ~912 µs | 4.60 Gelem/s | ~73 GB/s |
| 1M Rows | ~5.16 ms | 3.25 Gelem/s | ~52 GB/s |
Installation
[]
= "0.3.0"
Examples
Basics: Field Arithmetic
- Addition is equivalent to XOR (
^). - Subtraction is identical to Addition (since -x = x).
- 1 + 1 = 0. This is the defining property of Characteristic 2 fields.
use ;
The Isomorphic Workflow
Most ZK protocols require transitioning between the Canonical Basis (for recursive folding/sumcheck) and the Polynomial Basis (for heavy arithmetic).
use ;
SIMD Vectorization
For throughput-critical paths, hekate-math provides explicit SIMD packing via the PackableField trait.
use ;
Sparse Matrix-Vector Multiplication (SpMV)
A core primitive for Brakedown, Binius, and linear-code based commitments. The engine efficiently
promotes u8 matrix weights to Block128 on the fly using typed flat promotion (FlatPromote).
use ByteSparseMatrix;
use ;
Roadmap
The immediate engineering focus is establishing absolute hardware supremacy across both ARM and x86 backends.
-
x86_64 Hardware Acceleration (Beta → Prod)
- Replace software fallbacks with hand-tuned assembly/intrinsics for AVX2 and PCLMULQDQ.
- Goal: Path to x86_64 Supremacy.
-
Formal Verification & Execution Path Auditing
- Mathematical modeling of execution boundaries and DoS-resistant state transitions.
- Goal: Enforce strict
Resultpropagation across all public interfaces for enterprise-grade fault tolerance.
Theoretical Foundation
hekate-math implements a binary tower field architecture. The field 𝔽(2^128)
is constructed via recursive quadratic extensions using the reduction polynomial v² + v + βᵢ.
The Tower Hierarchy
The construction follows a strict recursive data layout. Higher-order blocks are composed of two lower-order blocks (Low, High).
Block128 (GF(2^128))
/ \
Block64 Block64
/ \ / \
Block32 Block32 ... ...
/ \
Block16 Block16
/ \
Block8 Block8 (Base Field GF(2^8))
|
[Bit; 8] (Atomic Unit GF(2))
Algebraic Construction
The extension defines 𝔽(2(2(i+1))) ≅ 𝔽(2(2i))[v] / (v² + v + βᵢ),
where βᵢ is the extension constant (EXTENSION_TAU) for that level.
| Height | Field | Implementation | Extension Constant (β) | Arithmetic |
|---|---|---|---|---|
| h=0 | 𝔽₂ | Bit |
N/A | Boolean (XOR/AND) |
| h=3 | 𝔽(2^8) | Block8 |
Base Field (AES Poly) | Recursive / Karatsuba |
| h=4 | 𝔽(2^16) | Block16 |
0x20 ∈ Block8 | Recursive / Karatsuba |
| h=5 | 𝔽(2^32) | Block32 |
0x2000 ∈ Block16 | Recursive / Karatsuba |
| h=6 | 𝔽(2^64) | Block64 |
0x20000000 ∈ Block32 | Recursive / Karatsuba |
| h=7 | 𝔽(2^128) | Block128 |
0x2000000000000000 ∈ Block64 | Recursive / Karatsuba |
Note: The tower is rooted at F(2^8) (AES Field) for hardware compatibility. Lower fields (Bit) are subfields embedded via isomorphism, making this a Hybrid Tower construction.
The Isomorphic Basis Architecture
To bridge the gap between algebraic recursion and CPU pipeline efficiency, hekate-math implements a hybrid basis
system. Canonical values stay in F, while hardware/polynomial values are represented explicitly as Flat<F>.
Canonical Basis (Tower)
The default representation optimized for recursive algebraic operations (e.g., Sumcheck, GKR Layer folding). Elements are structured as linear polynomials A(v) = a₁v + a₀ over the subfield.
- Structure: Recursive coefficients (a_hi, a_lo).
- Operation: Karatsuba Multiplication (3 sub-multiplications).
- Memory: Standard layout (Little-Endian).
Polynomial Basis (Flat)
An isomorphic representation mapping the tower structure to a dense polynomial basis (1, x, x²...) optimized for specific CPU instruction sets (AES-NI, PMULL, PCLMULQDQ).
- Structure: Linear bit-packed integers (
u8,u64,u128). - Operation: Single-cycle Carry-Less Multiplication (
CLMUL) with hardware-accelerated reduction. - Throughput: 1.17ns per multiplication (Block128 on modern architectures).
Isomorphism & Interop
The library strictly enforces basis separation through the type system to prevent mixing representations.
The Isomorphism φ is defined as: φ: 𝔽(Tower) ↔ 𝔽(Hardware)
Change-of-basis matrices are pre-computed constant-time bit-sliced operations by default, with an optional
table-math feature for cached lookups.
Implementation Details & Safety
hekate-math prioritizes correctness and side-channel resistance over "naive" speed, enforcing strict memory layouts
and algorithmic choices.
Memory Layout & Type Safety
Field elements are zero-cost wrappers around native integer types, ensuring ABI compatibility and predictable register usage.
- Scalar Storage:
#[repr(transparent)]structs wrappingu8,u16,u32,u64,u128. - Vector Storage:
#[repr(C, align(N))]SIMD-packed structs (e.g.,PackedBlock128is 32-byte aligned for AVX2/NEON compliance). - Safe Rust:
unsafeis restricted to SIMD intrinsics and bounds-checked lookups. Isomorphisms are checked via theHardwareFieldtrait system.
Security Model
The library operates under a configurable security model designed for cryptographic contexts where secret-dependent execution time is catastrophic.
| Feature Flag | Behavior | Use Case | Security |
|---|---|---|---|
default-features |
Bitsliced Constant-Time | Private Key / Prover | High (Side-Channel Resistant) |
table-math |
Cached Lookup Tables | Public Verifier / Rollup | Low (Variable Access Time) |
table-math |
Cached Lifting Tables | Public Data Ingestion | Low (Variable Access Time) |
- Basis Conversion: By default, φ and φ⁻¹ are computed using constant-time bit-sliced matrix multiplication, independent of the input value.
- Hardware Arithmetic:
Block128multiplication utilizes carry-less multiplication instructions (PMULLon ARMv8,PCLMULQDQon x86_64), which are constant-latency on modern microarchitectures.
Hardware Support
| Architecture | Feature Requirement | Instructions Used | Status |
|---|---|---|---|
| aarch64 | neon, pmull |
vmull_p64, veorq_u8 |
Production |
| x86_64 | N/A | xor, sw_mul |
Development |
| WASM | simd128 |
v128.xor, sw_mul |
Software Fallback |
Note: Native AVX2/PCLMULQDQ implementation for x86_64 is on the roadmap.
Reproduce benchmarks
[!IMPORTANT] Hardware arithmetic performance (e.g., mul_hardware, add_hardware) remains identical regardless of the
table-mathfeature. This feature specifically optimizes the Isomorphism (basis conversion) and Lifting operations. The actual field arithmetic in the flat basis always utilizes the fastest available hardware instructions (PMULL / PCLMULQDQ).
Secure (Default)
Uses constant-time bitsliced matrix multiplication for basis conversion and lifting:
Fast (table-math)
Uses precomputed lookup tables for basis conversion:
Security & Audits
[!WARNING] This implementation is currently UNAUDITED.
It is provided "AS IS" with ABSOLUTELY NO WARRANTY under the terms of the Apache 2.0 License. The authors assume zero liability for any damages arising from its use in production environments.
License
Licensed under Apache 2.0. See the LICENSE and NOTICE files for details.