Ligerito

polynomial commitment scheme over binary extension fields.

⚠️ IMPORTANT: For optimal performance (3x speedup), install with native CPU optimizations:
RUSTFLAGS="-C target-cpu=native" cargo install ligerito
Without this flag, the prover will be significantly slower. See installation for details.

what it's good for

committing to large polynomials with small proofs (~150 KB for 2^20 polynomial)
fast proving on modern cpus with simd (50ms for 1M elements on AVX-512, SMT off)
tiered simd: AVX-512 → AVX2 → SSE → scalar fallback
verifier-only builds for constrained environments (polkavm, wasm, embedded)
transparent setup (no trusted setup required)
enabling verifiable light client p2p networks

what it's not good for

general-purpose zkp (no arbitrary circuits, only polynomial commitments)
proving without simd (slow without hardware acceleration)
tiny polynomials (proof overhead significant below 2^12)
scenarios requiring smallest possible proofs (starkware/plonky2 may be smaller)

library usage

add to Cargo.toml:

[dependencies]
ligerito = "0.3.1"

⚠️ for development: clone the workspace to get automatic native optimizations:

git clone https://github.com/rotkonetworks/zeratul
cd zeratul
cargo build --release -p ligerito

example:

use ligerito::{prove, verify, hardcoded_config_20, hardcoded_config_20_verifier};
use ligerito_binary_fields::{BinaryElem32, BinaryElem128};
use std::marker::PhantomData;

// create prover config
let config = hardcoded_config_20(
    PhantomData::<BinaryElem32>,
    PhantomData::<BinaryElem128>,
);

// polynomial to commit (2^20 elements)
let poly: Vec<BinaryElem32> = vec![BinaryElem32::from(42); 1 << 20];

// generate proof
let proof = prove(&config, &poly).unwrap();

// verify proof
let verifier_config = hardcoded_config_20_verifier();
let valid = verify(&verifier_config, &proof).unwrap();
assert!(valid);

transcript backends

// sha256 (default, no extra deps, works in no_std)
use ligerito::{prove_sha256, verify_sha256};
let proof = prove_sha256(&config, &poly).unwrap();
let valid = verify_sha256(&verifier_config, &proof).unwrap();

// merlin (requires std + transcript-merlin feature)
use ligerito::{prove, verify};
let proof = prove(&config, &poly).unwrap();
let valid = verify(&verifier_config, &proof).unwrap();

supported sizes

configs available: hardcoded_config_{12,16,20,24,28,30} for prover and hardcoded_config_{12,16,20,24,28,30}_verifier for verifier.

build configurations

full-featured (default)

cargo build --release

includes: prover, verifier, parallelism, simd

verifier-only

cargo build --release --no-default-features --features="std,verifier-only"

~50% smaller binary, perfect for polkavm/on-chain verification

no_std verifier

cargo build --release --no-default-features --features="verifier-only"

minimal build for wasm/embedded (requires alloc)

cli usage

installation

recommended (optimized for your cpu):

# from crates.io
RUSTFLAGS="-C target-cpu=native" cargo install ligerito

# or from source
git clone https://github.com/rotkonetworks/zeratul
cd zeratul
cargo install --path crates/ligerito

the workspace config automatically enables native cpu optimizations.

performance by simd tier (8 cores, SMT off):

AVX-512 + VPCLMULQDQ:  ~50ms for 2^20 prove (fastest)
AVX2 + VPCLMULQDQ:     ~65ms for 2^20 prove
SSE + PCLMULQDQ:       ~95ms for 2^20 prove
No SIMD (scalar):      ~220ms for 2^20 prove

important: disable SMT (hyperthreading) for accurate benchmarks - it causes cache contention.

without optimizations (not recommended):

cargo install ligerito  # will show build warning about missing SIMD

check your build:

ligerito --version  # should show v0.2.4 or later
ligerito bench --size 20  # quick performance test (no I/O overhead)
# look for [release SIMD] in prove output for optimal performance

prove and verify

# generate random test data (2^20 = 1M elements)
ligerito generate --size 20 --pattern random > poly.bin

# generate proof from polynomial data
cat poly.bin | ligerito prove --size 20 > proof.bin

# verify proof
cat proof.bin | ligerito verify --size 20
# output: "VALID" with exit code 0

# roundtrip test
cat poly.bin | ligerito prove --size 20 | ligerito verify --size 20

transcript backends

prover and verifier must use the same transcript backend:

# sha256 (default)
ligerito prove --size 20 --transcript sha256 < poly.bin > proof.bin
ligerito verify --size 20 --transcript sha256 < proof.bin

# merlin (requires transcript-merlin feature)
ligerito prove --size 20 --transcript merlin < poly.bin > proof.bin
ligerito verify --size 20 --transcript merlin < proof.bin

generate test data

# random data (default)
ligerito generate --size 20 --pattern random > poly.bin

# all zeros
ligerito generate --size 20 --pattern zeros > poly.bin

# all ones
ligerito generate --size 20 --pattern ones > poly.bin

# sequential (0, 1, 2, ...)
ligerito generate --size 20 --pattern sequential > poly.bin

# save to file
ligerito generate --size 20 --pattern random --output test.bin

benchmark (no I/O overhead)

# prove only
ligerito bench --size 24

# prove + verify
ligerito bench --size 24 --verify

# multiple iterations
ligerito bench --size 24 --verify --iterations 10

# with specific thread count
RAYON_NUM_THREADS=16 ligerito bench --size 24

show configuration

ligerito config --size 20

data format

polynomials are binary data: size * 4 bytes (4 bytes per BinaryElem32 element).

example for 2^12:

# 2^12 elements = 4096 elements * 4 bytes = 16384 bytes
dd if=/dev/urandom of=test.bin bs=16384 count=1
cat test.bin | ligerito prove --size 12 | ligerito verify --size 12

features

std (default): standard library support
prover (default): include proving functionality
verifier-only: minimal verifier build
parallel (default): multi-threaded with rayon
hardware-accel (default): simd acceleration
transcript-sha256: sha256 transcript (always available)
transcript-merlin: merlin transcript (requires std)
cli: command-line binary
wasm: browser support with wasm-bindgen
wasm-parallel: multi-threaded WASM via Web Workers (requires SharedArrayBuffer)

wasm usage

# build WASM with SIMD128 (recommended)
RUSTFLAGS='-C target-feature=+simd128' \
  cargo build --target wasm32-unknown-unknown --features wasm --no-default-features

# with parallel support (requires SharedArrayBuffer + CORS headers)
RUSTFLAGS='-C target-feature=+simd128,+atomics,+bulk-memory,+mutable-globals' \
  cargo build --target wasm32-unknown-unknown --features wasm-parallel --no-default-features

wasm simd128 optimizations:

i8x16_swizzle for parallel 16-way table lookups
v128_xor for SIMD GF(2) additions
4-bit lookup table with Karatsuba decomposition

supported sizes

2^12 (4,096 elements, 16 KB)
2^16 (65,536 elements, 256 KB)
2^20 (1,048,576 elements, 4 MB)
2^24 (16,777,216 elements, 64 MB)
2^28 (268,435,456 elements, 1 GB)
2^30 (1,073,741,824 elements, 4 GB)

performance

benchmark results

benchmarked on amd ryzen 9 7945hx (16 cores / 32 threads, AVX-512 + VPCLMULQDQ):

size	elements	prove time	verify time	proof size	throughput
2^20	1.05m	41ms	5ms	149 KB	25.6m elem/s
2^24	16.8m	570ms	29ms	245 KB	29.4m elem/s
2^28	268.4m	~9s	~400ms	~2.4 MB	29.8m elem/s

use the built-in benchmark command (no I/O overhead):

ligerito bench --size 24 --verify --iterations 5

parallel scaling analysis

the prover is memory-bandwidth bound, not compute-bound. scaling efficiency drops significantly beyond 8 cores:

threads	2^24 prove	speedup	efficiency
1	4339ms	1.0x	100%
2	2301ms	1.9x	94%
4	1251ms	3.5x	87%
8	780ms	5.6x	70%
16	617ms	7.0x	44%
32	570ms	7.6x	24%

why scaling stops at ~8 cores:

memory bandwidth saturation: 2^24 elements = 64MB polynomial. the FFT reads/writes this data multiple times per level. DDR5 bandwidth (~100GB/s) becomes the bottleneck, not CPU cycles.
L3 cache pressure: ryzen 7945hx has 64MB L3 (32MB per CCD). the polynomial barely fits in cache. with >8 cores, cross-CCD traffic over infinity fabric adds latency.
amdahl's law: FFT has log2(n) = 24 levels. only some levels run in parallel. merkle tree construction and transcript hashing have limited parallelism.
diminishing returns: 7.6x speedup on 32 threads (vs theoretical 32x) is actually good for memory-bound workloads. typical memory-bound algorithms achieve 4-8x on many-core systems.

simd tier comparison

fft butterfly performance (2^24 elements, single iteration):

tier	instruction	elements/op	time	vs SSE
AVX-512	vpclmulqdq zmm	8	17ms	2.6x
AVX2	vpclmulqdq ymm	4	25ms	1.8x
SSE	pclmulqdq xmm	2	45ms	baseline

runtime detection automatically selects the best available tier.

cli vs library performance

the CLI has ~2x overhead when using pipes due to I/O:

method	2^24 prove
`ligerito bench --size 24`	570ms
`generate \| prove \| verify`	~1200ms

the overhead comes from:

piping 64MB through stdin/stdout
random number generation in generate
proof serialization/deserialization

for benchmarking, always use ligerito bench which measures pure algorithm performance.

optimal configuration

# best performance: all threads, SMT on
RAYON_NUM_THREADS=32 ligerito bench --size 24

# consistent benchmarks: physical cores only, SMT off
echo "off" | sudo tee /sys/devices/system/cpu/smt/control
RAYON_NUM_THREADS=16 taskset -c 0-15 ligerito bench --size 24
echo "on" | sudo tee /sys/devices/system/cpu/smt/control

# single CCD (lower latency, less throughput)
RAYON_NUM_THREADS=8 taskset -c 0-7 ligerito bench --size 24

improving performance further

current bottlenecks and potential solutions:

bottleneck	solution	expected gain
memory bandwidth	GPU acceleration (HBM)	5-10x
FFT memory access	cache-blocked FFT	1.2-1.5x
field element size	GF(2^{16) instead of GF(2}32)	1.5-2x
cross-CCD latency	NUMA-aware allocation	1.1-1.2x

reference

ligerito paper by andrija novakovic and guillermo angeris

license

mit / apache-2.0

ligerito 0.6.2

Ligerito

what it's good for

what it's not good for

library usage

transcript backends

supported sizes

build configurations

full-featured (default)

verifier-only

no_std verifier

cli usage

installation

prove and verify

transcript backends

generate test data

benchmark (no I/O overhead)

show configuration

data format

features

wasm usage

supported sizes

performance

benchmark results

parallel scaling analysis

simd tier comparison

cli vs library performance

optimal configuration

improving performance further

reference

license