Ligerito
polynomial commitment scheme over binary extension fields.
⚠️ IMPORTANT: For optimal performance (3x speedup), install with native CPU optimizations:
RUSTFLAGS="-C target-cpu=native"Without this flag, the prover will be significantly slower. See installation for details.
what it's good for
- committing to large polynomials with small proofs (~150 KB for 2^20 polynomial)
- fast proving on modern cpus with simd (50ms for 1M elements on AVX-512, SMT off)
- tiered simd: AVX-512 → AVX2 → SSE → scalar fallback
- verifier-only builds for constrained environments (polkavm, wasm, embedded)
- transparent setup (no trusted setup required)
- enabling verifiable light client p2p networks
what it's not good for
- general-purpose zkp (no arbitrary circuits, only polynomial commitments)
- proving without simd (slow without hardware acceleration)
- tiny polynomials (proof overhead significant below 2^12)
- scenarios requiring smallest possible proofs (starkware/plonky2 may be smaller)
library usage
add to Cargo.toml:
[]
= "0.3.1"
⚠️ for development: clone the workspace to get automatic native optimizations:
example:
use ;
use ;
use PhantomData;
// create prover config
let config = hardcoded_config_20;
// polynomial to commit (2^20 elements)
let poly: = vec!;
// generate proof
let proof = prove.unwrap;
// verify proof
let verifier_config = hardcoded_config_20_verifier;
let valid = verify.unwrap;
assert!;
transcript backends
// sha256 (default, no extra deps, works in no_std)
use ;
let proof = prove_sha256.unwrap;
let valid = verify_sha256.unwrap;
// merlin (requires std + transcript-merlin feature)
use ;
let proof = prove.unwrap;
let valid = verify.unwrap;
supported sizes
configs available: hardcoded_config_{12,16,20,24,28,30} for prover and hardcoded_config_{12,16,20,24,28,30}_verifier for verifier.
build configurations
full-featured (default)
includes: prover, verifier, parallelism, simd
verifier-only
~50% smaller binary, perfect for polkavm/on-chain verification
no_std verifier
minimal build for wasm/embedded (requires alloc)
cli usage
installation
recommended (optimized for your cpu):
# from crates.io
RUSTFLAGS="-C target-cpu=native"
# or from source
the workspace config automatically enables native cpu optimizations.
performance by simd tier (8 cores, SMT off):
AVX-512 + VPCLMULQDQ: ~50ms for 2^20 prove (fastest)
AVX2 + VPCLMULQDQ: ~65ms for 2^20 prove
SSE + PCLMULQDQ: ~95ms for 2^20 prove
No SIMD (scalar): ~220ms for 2^20 prove
important: disable SMT (hyperthreading) for accurate benchmarks - it causes cache contention.
without optimizations (not recommended):
check your build:
# look for [release SIMD] in prove output for optimal performance
prove and verify
# generate random test data (2^20 = 1M elements)
# generate proof from polynomial data
|
# verify proof
|
# output: "VALID" with exit code 0
# roundtrip test
| |
transcript backends
prover and verifier must use the same transcript backend:
# sha256 (default)
# merlin (requires transcript-merlin feature)
generate test data
# random data (default)
# all zeros
# all ones
# sequential (0, 1, 2, ...)
# save to file
benchmark (no I/O overhead)
# prove only
# prove + verify
# multiple iterations
# with specific thread count
RAYON_NUM_THREADS=16
show configuration
data format
polynomials are binary data: size * 4 bytes (4 bytes per BinaryElem32 element).
example for 2^12:
# 2^12 elements = 4096 elements * 4 bytes = 16384 bytes
| |
features
std(default): standard library supportprover(default): include proving functionalityverifier-only: minimal verifier buildparallel(default): multi-threaded with rayonhardware-accel(default): simd accelerationtranscript-sha256: sha256 transcript (always available)transcript-merlin: merlin transcript (requires std)cli: command-line binarywasm: browser support with wasm-bindgenwasm-parallel: multi-threaded WASM via Web Workers (requires SharedArrayBuffer)
wasm usage
# build WASM with SIMD128 (recommended)
RUSTFLAGS='-C target-feature=+simd128' \
# with parallel support (requires SharedArrayBuffer + CORS headers)
RUSTFLAGS='-C target-feature=+simd128,+atomics,+bulk-memory,+mutable-globals' \
wasm simd128 optimizations:
i8x16_swizzlefor parallel 16-way table lookupsv128_xorfor SIMD GF(2) additions- 4-bit lookup table with Karatsuba decomposition
supported sizes
- 2^12 (4,096 elements, 16 KB)
- 2^16 (65,536 elements, 256 KB)
- 2^20 (1,048,576 elements, 4 MB)
- 2^24 (16,777,216 elements, 64 MB)
- 2^28 (268,435,456 elements, 1 GB)
- 2^30 (1,073,741,824 elements, 4 GB)
performance
benchmark results
benchmarked on amd ryzen 9 7945hx (16 cores / 32 threads, AVX-512 + VPCLMULQDQ):
| size | elements | prove time | verify time | proof size | throughput |
|---|---|---|---|---|---|
| 2^20 | 1.05m | 41ms | 5ms | 149 KB | 25.6m elem/s |
| 2^24 | 16.8m | 570ms | 29ms | 245 KB | 29.4m elem/s |
| 2^28 | 268.4m | ~9s | ~400ms | ~2.4 MB | 29.8m elem/s |
use the built-in benchmark command (no I/O overhead):
parallel scaling analysis
the prover is memory-bandwidth bound, not compute-bound. scaling efficiency drops significantly beyond 8 cores:
| threads | 2^24 prove | speedup | efficiency |
|---|---|---|---|
| 1 | 4339ms | 1.0x | 100% |
| 2 | 2301ms | 1.9x | 94% |
| 4 | 1251ms | 3.5x | 87% |
| 8 | 780ms | 5.6x | 70% |
| 16 | 617ms | 7.0x | 44% |
| 32 | 570ms | 7.6x | 24% |
why scaling stops at ~8 cores:
-
memory bandwidth saturation: 2^24 elements = 64MB polynomial. the FFT reads/writes this data multiple times per level. DDR5 bandwidth (~100GB/s) becomes the bottleneck, not CPU cycles.
-
L3 cache pressure: ryzen 7945hx has 64MB L3 (32MB per CCD). the polynomial barely fits in cache. with >8 cores, cross-CCD traffic over infinity fabric adds latency.
-
amdahl's law: FFT has log2(n) = 24 levels. only some levels run in parallel. merkle tree construction and transcript hashing have limited parallelism.
-
diminishing returns: 7.6x speedup on 32 threads (vs theoretical 32x) is actually good for memory-bound workloads. typical memory-bound algorithms achieve 4-8x on many-core systems.
simd tier comparison
fft butterfly performance (2^24 elements, single iteration):
| tier | instruction | elements/op | time | vs SSE |
|---|---|---|---|---|
| AVX-512 | vpclmulqdq zmm | 8 | 17ms | 2.6x |
| AVX2 | vpclmulqdq ymm | 4 | 25ms | 1.8x |
| SSE | pclmulqdq xmm | 2 | 45ms | baseline |
runtime detection automatically selects the best available tier.
cli vs library performance
the CLI has ~2x overhead when using pipes due to I/O:
| method | 2^24 prove |
|---|---|
ligerito bench --size 24 |
570ms |
generate | prove | verify |
~1200ms |
the overhead comes from:
- piping 64MB through stdin/stdout
- random number generation in
generate - proof serialization/deserialization
for benchmarking, always use ligerito bench which measures pure algorithm performance.
optimal configuration
# best performance: all threads, SMT on
RAYON_NUM_THREADS=32
# consistent benchmarks: physical cores only, SMT off
|
RAYON_NUM_THREADS=16
|
# single CCD (lower latency, less throughput)
RAYON_NUM_THREADS=8
improving performance further
current bottlenecks and potential solutions:
| bottleneck | solution | expected gain |
|---|---|---|
| memory bandwidth | GPU acceleration (HBM) | 5-10x |
| FFT memory access | cache-blocked FFT | 1.2-1.5x |
| field element size | GF(216) instead of GF(232) | 1.5-2x |
| cross-CCD latency | NUMA-aware allocation | 1.1-1.2x |
reference
ligerito paper by andrija novakovic and guillermo angeris
license
mit / apache-2.0