hekate-math

Copyright (c) Andrei Kochergin and Oumuamua Labs.

Hardware-accelerated binary tower fields for zero-knowledge proofs.

hekate-math provides a high-performance, constant-time implementation of binary tower fields (𝔽(2^{k))
optimized for GKR-based provers, Sumcheck, and Binius protocols. The library implements a rigorous algebraic tower
construction up to 𝔽(2}256), leveraging basis isomorphism to utilize native CPU hardware instructions: PMULL (ARMv8 NEON) and PCLMULQDQ (x86_64 AVX2).

Designed for low-level cryptographic engineering, the crate is no-std compatible and defaults to constant-time execution paths to mitigate side-channel attacks. It enforces strict type safety between canonical (tower) and polynomial (flat/hardware) representations.

This is the mathematical core of the Hekate ZK Engine.

Performance Metrics

[!NOTE] Current benchmarks are reported with the table-math feature enabled to reflect peak performance for public-data scenarios. For private-key operations, use the default constant-time backend.

Benchmarks executed on Apple M3 Max (aarch64). The library achieves near-native memory bandwidth saturation and single-cycle throughput for hardware-accelerated operations.

Micro-Benchmarks (Block128)

Operation	Basis	Latency	Implementation
Multiplication	Polynomial (Flat)	1.08 ns	`PMULL` (Pipelined)
Multiplication	Tower (Canonical)	98.3 ns	Recursive Karatsuba
Addition	Any	1.14 ns	Vectorized XOR
Inversion (Single)	Tower	246.6 ns	Itoh-Tsujii / Fermat Little Theorem
Inversion (Batch)	Tower	15.7 ns	Montgomery's Trick (SIMD)
Basis Conv (Default)	Tower ↔ Flat	90.0 ns	Bit-Slicing (Constant-Time)
Basis Conv (Fast)	Tower ↔ Flat	3.80 ns	Look-Up Table (Variable-Time)

Impact: Flat basis multiplication is approximately 100x faster than the canonical recursive implementation.

Polynomial Arithmetic (Poly ALU)

Efficiency of polynomial operations in 𝔽(2^128).

Operation	Scenario / Size	Time	Throughput
Dense Eval (Tower)	2²⁰ coeffs	91.93 ms	174 MiB/s
Dense Eval (Hardware)	2²⁰ coeffs	8.34 ms	1.87 GiB/s
Batch Eval (SIMD)	256 × 16384	5.43 ms	772 Melem/s
Additive FFT (scalar)	2¹⁶ · Block16	1.31 ms	50 Melem/s
Additive FFT (packed)	2¹⁶ · ×8 lanes	3.16 ms	166 Melem/s
Interpolate MSM	65536 points	77.12 µs	850 Melem/s
MLE Evaluation	20 variables	1.27 ms	822 Melem/s

Sparse Matrix-Vector Multiplication (SpMV)

Benchmarks for Block128 SpMV with fixed degree 16 (typical for Brakedown/Binius).

Matrix Size	Time (M3 Max)	Throughput	Memory Bandwidth (est.)
64K Rows	~171 µs	6.14 Gelem/s	~98 GB/s
256K Rows	~628 µs	6.68 Gelem/s	~107 GB/s
1M Rows	~5.17 ms	3.25 Gelem/s	~52 GB/s

Installation

[dependencies]
hekate-math = "0.8.0"

Examples

Basics: Field Arithmetic

Addition is equivalent to XOR (^).
Subtraction is identical to Addition (since -x = x).
1 + 1 = 0. This is the defining property of Characteristic 2 fields.

use hekate_math::{Block128, TowerField};

fn example_basics() {
    // Initialize elements (Block128 represents GF(2^128))
    let a = Block128::from(5u128); // 101
    let b = Block128::from(3u128); // 011

    // 1. Addition is XOR
    // 5 XOR 3 = 6 (110)
    let sum = a + b;
    assert_eq!(sum, Block128::from(6u128));

    // 2. Characteristic 2 Property
    // Adding an element to itself results in Zero.
    let zero = a + a;
    assert_eq!(zero, Block128::ZERO);
    assert_eq!(Block128::from(5u128) - a, Block128::ZERO); // Subtraction is also XOR

    // 3. Multiplication
    // This uses Galois Field arithmetic
    // (carrying over the irreducible polynomial).
    let product = a * b;

    // In normal integers 5*3=15, but in GF(2^128)
    // it behaves differently based on reduction.
    println!("Basics: 5 * 3 in GF(2^128) = {:?}", product);
}

The Isomorphic Workflow

Most ZK protocols require transitioning between the Canonical Basis (for recursive folding/sumcheck) and the Polynomial Basis (for heavy arithmetic).

use hekate_math::{Block128, HardwareField, TowerField};

fn example_isomorphism() {
    // 1. Canonical Basis (Tower)
    let a_tower = Block128::from_uniform_bytes(&[0xaa; 32]);
    let b_tower = Block128::from_uniform_bytes(&[0xbb; 32]);

    // 2. Basis Conversion -> Polynomial (Flat)
    let a_flat = a_tower.to_hardware();
    let b_flat = b_tower.to_hardware();

    // 3. Hardware-Accelerated Arithmetic
    let c_flat = a_flat * b_flat;
    let d_flat = a_flat + b_flat;

    // 4. Return to Canonical Basis
    let c_tower = c_flat.to_tower();
    let d_tower = d_flat.to_tower();

    // 5. Verify Homomorphism
    assert_eq!(
        c_tower,
        a_tower * b_tower,
        "Multiplication Homomorphism failed"
    );
    assert_eq!(d_tower, a_tower + b_tower, "Addition Homomorphism failed");
}

SIMD Vectorization

For throughput-critical paths, hekate-math provides explicit SIMD packing via the PackableField trait.

use hekate_math::{Block32, Flat, HardwareField, PackableField, TowerField};

fn process_simd(data: &[Flat<Block32>]) {
    // 1. Pack hardware-basis scalars into SIMD registers
    // PackedBlock32 holds 4 elements (128 bits total).
    // The data must already be in the Flat/Hardware
    // basis for hardware-accelerated operations
    // to be algebraically correct.
    let chunk_a = Flat::<Block32>::pack(&data[0..4]);
    let chunk_b = Flat::<Block32>::pack(&data[4..8]);

    // 2. Vectorized Arithmetic
    // Performs 4 parallel field
    // multiplications in the hardware basis.
    let result_packed = chunk_a * chunk_b;

    // 3. Unpack back to scalars
    let mut out_flat = [Block32::ZERO.to_hardware(); 4];
    Flat::<Block32>::unpack(result_packed, &mut out_flat);

    // 4. Verification
    for i in 0..4 {
        // Convert back to verify
        // against standard multiplication.
        let res_tower = out_flat[i].to_tower();

        // Manual tower multiplication for comparison
        let a_tower = data[i].to_tower();
        let b_tower = data[4 + i].to_tower();

        assert_eq!(res_tower, a_tower * b_tower, "SIMD multiplication mismatch");
    }
}

fn example_simd() {
    // Initialize data and immediately
    // transform to Hardware Basis.
    let data: Vec<Flat<Block32>> = (0..8)
        .map(|i| Block32::from(i as u32 + 1).to_hardware())
        .collect();

    process_simd(&data);
}

Sparse Matrix-Vector Multiplication (SpMV)

A core primitive for Brakedown, Binius, and linear-code based commitments. ByteSparseMatrix holds binary weights (0/1) packed as u8 and applies them to a hardware-basis vector by branch-gated XOR, no per-weight field multiply. The vector comes from any VectorSource, so dense slices and zero-allocation algorithmic generators share a single code path.

Random expander sampling, drawing the matrix from a seed, lives in the consumer, not here: which oracle samples the code is a PCS soundness parameter, not field arithmetic.

use hekate_math::matrix::ByteSparseMatrix;
use hekate_math::{Block128, Flat, HardwareField};

fn example_spmv() {
    // 2 rows, 3 cols, degree 2. Binary weights only.
    // Row 0 = col0 + col2,  Row 1 = col1 + col0.
    let weights = vec![1u8, 1, 1, 1];
    let col_indices = vec![0, 2, 1, 0];
    let matrix = ByteSparseMatrix::new(2, 3, 2, weights, col_indices);

    // Input vector in the hardware (flat) basis.
    let x0 = Block128::from(10u128).to_hardware();
    let x1 = Block128::from(100u128).to_hardware();
    let x2 = Block128::from(255u128).to_hardware();
    let input: Vec<Flat<Block128>> = vec![x0, x1, x2];

    let output = matrix.spmv(input.as_slice());

    // Addition in a binary field is XOR.
    assert_eq!(output[0], x0 + x2);
    assert_eq!(output[1], x1 + x0);
}

Roadmap

The immediate engineering focus is establishing absolute hardware supremacy across both ARM and x86 backends.

x86_64 Hardware Acceleration (Beta → Prod)
- Replace software fallbacks with hand-tuned assembly/intrinsics for AVX2 and PCLMULQDQ.
- Goal: Path to x86_64 Supremacy.
Formal Verification & Execution Path Auditing
- Mathematical modeling of execution boundaries and DoS-resistant state transitions.
- Goal: Enforce strict Result propagation across all public interfaces for enterprise-grade fault tolerance.

Theoretical Foundation

hekate-math implements a binary tower field architecture. The field 𝔽(2^128) is constructed via recursive quadratic extensions using the reduction polynomial v² + v + βᵢ.

The Tower Hierarchy

The construction follows a strict recursive data layout. Higher-order blocks are composed of two lower-order blocks (Low, High).

                    Block256 (GF(2^256))
                    /              \
              Block128              Block128 (GF(2^128))
                /    \              /     \
          Block64   Block64       ...     ...
           /    \
       Block32  Block32
        /    \
    Block16  Block16
     /    \
 Block8   Block8  (Base Field GF(2^8))
    |
  [Bit; 8]        (Atomic Unit GF(2))

Algebraic Construction

The extension defines 𝔽(2⁽²(i+1))) ≅ 𝔽(2⁽²i))[v] / (v² + v + βᵢ), where βᵢ is the extension constant (EXTENSION_TAU) for that level.

Height	Field	Implementation	Extension Constant (β)	Arithmetic
h=0	𝔽₂	`Bit`	N/A	Boolean (XOR/AND)
h=3	𝔽(2^8)	`Block8`	Base Field (AES Poly)	Recursive / Karatsuba
h=4	𝔽(2^16)	`Block16`	0x20 ∈ Block8	Recursive / Karatsuba
h=5	𝔽(2^32)	`Block32`	0x2000 ∈ Block16	Recursive / Karatsuba
h=6	𝔽(2^64)	`Block64`	0x20000000 ∈ Block32	Recursive / Karatsuba
h=7	𝔽(2^128)	`Block128`	0x2000000000000000 ∈ Block64	Recursive / Karatsuba
h=8	𝔽(2^256)	`Block256`	0x20000000000000000000000000000000 ∈ Block128	Recursive / Karatsuba

Note: The tower is rooted at F(2^8) (AES Field) for hardware compatibility. Lower fields (Bit) are subfields embedded via isomorphism, making this a Hybrid Tower construction.

The Isomorphic Basis Architecture

To bridge the gap between algebraic recursion and CPU pipeline efficiency, hekate-math implements a hybrid basis system. Canonical values stay in F, while hardware/polynomial values are represented explicitly as Flat<F>.

Canonical Basis (Tower)

The default representation optimized for recursive algebraic operations (e.g., Sumcheck, GKR Layer folding). Elements are structured as linear polynomials A(v) = a₁v + a₀ over the subfield.

Structure: Recursive coefficients (a_hi, a_lo).
Operation: Karatsuba Multiplication (3 sub-multiplications).
Memory: Standard layout (Little-Endian).

Polynomial Basis (Flat)

An isomorphic representation mapping the tower structure to a dense polynomial basis (1, x, x²...) optimized for specific CPU instruction sets (AES-NI, PMULL, PCLMULQDQ).

Structure: Linear bit-packed integers (u8, u64, u128).
Operation: Single-cycle Carry-Less Multiplication (CLMUL) with hardware-accelerated reduction.
Throughput: 1.17ns per multiplication (Block128 on modern architectures).

Isomorphism & Interop

The library strictly enforces basis separation through the type system to prevent mixing representations.

The Isomorphism φ is defined as: φ: 𝔽(Tower) ↔ 𝔽(Hardware)

pub trait HardwareField: TowerField + PackableField {
    /// Transform Canonical -> Flat
    fn to_hardware(self) -> Flat<Self>;

    /// Transform Flat -> Canonical
    fn from_hardware(value: Flat<Self>) -> Self;

    /// Sum two elements assuming they
    /// are already in hardware basis.
    fn add_hardware(lhs: Flat<Self>, rhs: Flat<Self>) -> Flat<Self>;

    /// Multiply two elements assuming
    /// they are already in hardware basis.
    fn mul_hardware(lhs: Flat<Self>, rhs: Flat<Self>) -> Flat<Self>;

    /// Extract a bit of the Tower representation
    /// directly from the Hardware basis.
    fn tower_bit_from_hardware(value: Flat<Self>, bit_idx: usize) -> u8;
}

Change-of-basis matrices are pre-computed constant-time bit-sliced operations by default, with an optional table-math feature for cached lookups.

Implementation Details & Safety

hekate-math prioritizes correctness and side-channel resistance over "naive" speed, enforcing strict memory layouts and algorithmic choices.

Memory Layout & Type Safety

Field elements are zero-cost wrappers around native integer types, ensuring ABI compatibility and predictable register usage.

Scalar Storage: #[repr(transparent)] structs wrapping u8, u16, u32, u64, u128.
Vector Storage: #[repr(C, align(N))] SIMD-packed structs (e.g., PackedBlock128 is 32-byte aligned for AVX2/NEON compliance).
Safe Rust: unsafe is restricted to SIMD intrinsics and bounds-checked lookups. Isomorphisms are checked via the HardwareField trait system.

Security Model

The library operates under a configurable security model designed for cryptographic contexts where secret-dependent execution time is catastrophic.

Feature Flag	Behavior	Use Case	Security
`default-features`	Bitsliced Constant-Time	Private Key / Prover	High (Side-Channel Resistant)
`table-math`	Cached Lookup Tables	Public Verifier / Rollup	Low (Variable Access Time)
`table-math`	Cached Lifting Tables	Public Data Ingestion	Low (Variable Access Time)

Basis Conversion: By default, φ and φ⁻¹ are computed using constant-time bit-sliced matrix multiplication, independent of the input value.
Hardware Arithmetic: Block128 multiplication utilizes carry-less multiplication instructions (PMULL on ARMv8, PCLMULQDQ on x86_64), which are constant-latency on modern microarchitectures.

Hardware Support

Architecture	Feature Requirement	Instructions Used	Status
aarch64	`neon`, `pmull`	`vmull_p64`, `veorq_u8`	Production
x86_64	N/A	`xor`, `sw_mul`	Development
WASM	`simd128`	`v128.xor`, `sw_mul`	Software Fallback

Note: Native AVX2/PCLMULQDQ implementation for x86_64 is on the roadmap.

Reproduce benchmarks

[!IMPORTANT] Hardware arithmetic performance (e.g., mul_hardware, add_hardware) remains identical regardless of the table-math feature. This feature specifically optimizes the Isomorphism (basis conversion) and Lifting operations. The actual field arithmetic in the flat basis always utilizes the fastest available hardware instructions (PMULL / PCLMULQDQ).

Secure (Default)

Uses constant-time bitsliced matrix multiplication for basis conversion and lifting:

cargo bench

Fast (table-math)

Uses precomputed lookup tables for basis conversion:

cargo bench --features table-math

Security & Audits

[!WARNING] This implementation is currently UNAUDITED.

It is provided "AS IS" with ABSOLUTELY NO WARRANTY under the terms of the Apache 2.0 License. The authors assume zero liability for any damages arising from its use in production environments.

License

Licensed under Apache 2.0. See the LICENSE and NOTICE files for details.

hekate-math 0.8.0

hekate-math

Performance Metrics

Micro-Benchmarks (Block128)

Polynomial Arithmetic (Poly ALU)

Sparse Matrix-Vector Multiplication (SpMV)

Installation

Examples

Basics: Field Arithmetic

The Isomorphic Workflow

SIMD Vectorization

Sparse Matrix-Vector Multiplication (SpMV)

Roadmap

Theoretical Foundation

The Tower Hierarchy

Algebraic Construction

The Isomorphic Basis Architecture

Canonical Basis (Tower)

Polynomial Basis (Flat)

Isomorphism & Interop

Implementation Details & Safety

Memory Layout & Type Safety

Security Model

Hardware Support

Reproduce benchmarks

Secure (Default)

Fast (table-math)

Security & Audits

License