Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
cubemoma
cubemoma is a Rust crate providing an efficient CPU-based implementation of Multi-word Modular Arithmetic (MoMA) for cryptographic kernels, inspired by the paper "Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU" by Naifeng Zhang and Franz Franchetti (arXiv:2501.07535v1, published January 13, 2025). This crate focuses on the CPU version, implementing finite field arithmetic over large primes (e.g., 448-bit) without relying on a code generator. It supports operations essential for Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKPs), such as modular addition, multiplication, inversion, square roots, and Number Theoretic Transforms (NTT).
Description
This project adapts the MoMA formalization from the paper, which decomposes large integer arithmetic into machine-word operations (u64 limbs). Unlike the original GPU-focused work, cubemoma provides a static, const-generic Rust implementation for CPU environments. Key contributions from the paper include:
- MoMA Rewrite System: Breaks down large bit-width integers (e.g., 256-768 bits for ZKPs, up to 1000+ bits for FHE) into recursive word-level operations compatible with compilers.
- Cryptographic Kernels: Supports BLAS-like operations and NTT, with performance claims of orders-of-magnitude improvements over multi-precision libraries like GMP, and near-ASIC efficiency on GPUs (e.g., 14x faster than NVIDIA H100 for 256-bit NTT).
cubemoma translates these ideas to CPU, using Barrett reduction for modular ops and optimizations like pre-computed twiddles for NTT. It does not include GPU/CubeCL support yet.
For the full paper details, see the arXiv link.
Features
- Finite Field Arithmetic (
BigField<N>): Modular addition, subtraction, multiplication (schoolbook + Barrett), inversion (Fermat), and square root (Tonelli-Shanks with Legendre symbol check). - Quadratic Extension (
FpComplex<N>): Operations like mul_mod, add_mod, sub_mod, square_mod, norm_mod, inv_mod. - NTT Optimization: In-place Cooley-Tukey NTT with pre-computed twiddles for efficiency.
- Endianness Support: from_be/le, to_be/le for byte conversions.
- Const Generics: Parameterized by limb count
N(e.g., N=7 for 448-bit fields). - Exceptions: inv_mod and sqrt_mod return
Optionfor non-invertible/non-residue cases. - WASM Support: Basic bindings for browser-based benchmarks (CPU-only).
Requirements
- Rust edition: 2024
- Rust toolchain: nightly-2025-10-01 or later
- Dependencies: rand (for random field elements), wasm-bindgen (for WASM builds)
Installation
Add to your Cargo.toml:
[]
= "0.0.1" # Replace with actual version
= "0.9"
[]
= "0.2"
Build with cargo build --release. For WASM: wasm-pack build --target web.
Usage
Basic Field Operations
use ;
const N: usize = 7; // 448-bit
let p_limbs: = ;
let modulus = new;
let mu = precompute_mu;
let a = random;
let b = random;
let sum = a.add_mod;
let product = a.mul_mod;
if let Some = a.inv_mod
if let Some = a.sqrt_mod
NTT Example
let omega = from_limb; // Primitive root
let precompute = new;
let mut input: = .map.collect;
ntt;
Benchmarks
Run cargo bench for CPU benchmarks. Example results (AMD Ryzen 7 5800H, Rust nightly-2025-10-01):
- fp_add_mod: ~8.13 ns
- fp_mul_mod: ~251.38 ns
- fp_square_mod: ~252.90 ns
- fp_repeated_mul_mod (1000): ~253.15 µs
- fp_ntt (n=256): ~277.69 µs
- fp_inv_mod (Fermat): ~223.33 µs
WASM results (browser, post-optimization): mul_mod ~2 µs (from repeated), NTT ~2 ms – 8-10x faster than initial.
Contributing
Contributions welcome! Fork the repo, create a branch, and submit a PR. Focus on CPU optimizations (e.g., assembly intrinsics). Issues for bugs or features.
License
2-Clause BSD License. See LICENSE for details.
Acknowledgments
Based on the paper by Naifeng Zhang and Franz Franchetti. Special thanks to xAI for assistance in development.