burn_p2p 🔥🤝
burn_p2p turns a burn learner into a decentralized training network.
core shape:
- trainers lease shard slices, sync the latest visible canonical head, run one train window, then publish candidate updates
- reducers are non-authoritative proposal builders; they can reduce early, but validators independently rescreen the cohort and can locally re-reduce if reducer output is missing or wrong
- only validator quorum can attest a reduction, issue merge/quorum certificates, and advance the canonical head
- quarantined or revoked peers are filtered out of future trainer, reducer, and validator pools
- cheap bootstrap/coherence seeds handle ingress, discovery, relay fallback, and browser-edge http state
- the same network can include native peers, browser peers, viewers, reducers, validators, and trainer pools on different hardware classes
install
[]
= { = "=0.21.0-pre.8", = ["burn"] }
happy path
use Learner;
use ;
use from_loaders;
let mut trainer = from_loaders
.trainer?
.with_network?
.with_storage
.with_bootstrap_peer
.spawn?;
let experiment = trainer.experiment;
let mut session = trainer.continuous_trainer?;
let outcome = session.train_next_window?;
println!;
keep your existing burn model, optimizer, scheduler, and loaders.
use train_window_once(...) instead when you want a single strictly
orchestrated training window with no retained session state.
burn_p2p handles:
- head sync
- lease-scoped shard assignment
- window-by-window training publication
- checkpoint/artifact movement
- reducer proposal flow
- validator attestation and promotion
- peer discovery, relay fallback, and control-plane sync
safety boundary
the important trust split is:
- trainer updates are candidate inputs, not canonical state
- reducer output is a proposal, not canonical state
- validator quorum is the authority boundary for canonical promotion
that means:
- a bad trainer can be rejected, downweighted, quarantined, or left out of the merge
- a bad reducer can waste time or bandwidth, but validators still recompute the expected accepted cohort locally before promotion
- if a dedicated reducer is silent or serves a mismatched aggregate, validators fall back to local reduction instead of letting the reducer stall the window
- a canonical head only moves after validator attestation and merge promotion
the main operating guidance for adversarial or semi-trusted deployments is:
- keep reducers and validators as separate roles
- run more than one validator;
validator_quorum = 1is a lab mode, not a safe default - use admission/auth for untrusted membership
- treat reducers as optional acceleration, not as authorities
this is a decentralized training protocol, not a full bft ledger. if validator quorum is compromised, canonical safety is compromised too.
most deployments should separate:
- cheap bootstrap/coherence seeds
- reducer nodes
- validator / authority nodes
- trainer pools
the reference deploy shape follows that split directly:
- bootstrap seeds are public ingress/discovery nodes
- validators and reducers stay private by default
- browser edge is an explicit opt-in profile, not the baseline split-fleet role
for trainer nodes, from_loaders(...) is still the main public entrypoint. use
from_learner(...) for reducer, validator, viewer, and helper-style runtime
roles.
data
one lease is one micro-epoch. that is the unit that drives publish cadence and canonical reconcile.
use with_sharded_dataset(...) when data already lives as prepared shard
files.
use LeaseDataPipeline<Device, Batch> when batches should be rebuilt from
indices, samplers, seeds, recipes, or custom lease metadata.
pipeline kinds stay simple:
ShardedStatic: shard filesIndexedDataset: dataset + sampler scopeGeneratedDataset: deterministic generationCustom: anything else
burn uses .with_data_pipeline(...). python/torch uses
PythonTorchProject::new_with_data_pipeline(...).
both adapters expose the same inspection surface:
data_pipeline_descriptor()data_pipeline_kind()local_upstream_root()
local_upstream_root() only returns Some(...) for local shard-backed
pipelines.
native peers exchange control-plane state, heads, checkpoints, and artifacts over the peer network.
browser peers fetch only the active lease-scoped shard data through the browser
edge. today that path is peer-backed (p2p-artifact-via-edge): native peers
sync the prepared shard bundle over the overlay, and the edge serves only the
leased slice to the browser.
what the repo includes
burn_p2p: core runtime, burn-facing facade, training, validation, and promotion flowburn_p2p_workload: backend-neutral workload and lease-data-pipeline seamburn_p2p_python: subprocess-backed python/torch workload adapterburn_p2p_limits: capability probing and role/budget heuristics for native and browser peersburn_p2p_swarm: native transport, discovery, relay/rendezvous integration, and control-plane event modelburn_p2p_bootstrap: coherence-seed, reducer/validator deployment surface, and browser-edge http/admin surfaceburn_p2p_browser: browser runtime bridge and wasm-facing transport glueburn_p2p_app: reference dioxus app and browser-edge product surfaceburn_p2p_publish: artifact export and publication surfacesexamples/mnist_p2p_demo: real downstream-style mixed-fleet demo used bycargo xtask e2e mnistexamples/torch_mnist_p2p_demo: python/torch subprocess-backed mnist demo using the same p2p runtime
same experiment layout works across native and browser peers. browser-facing runtime and ui live in the companion crates above.
see it working
single-machine mixed-fleet mnist sanity run:
best follow-up docs:
- docs/examples/mnist.md
- docs/downstream-burn-guide.md
- docs/learning-dynamics.md
- docs/protocol-shape.md
- docs/formal-verification-plan.md
- docs/production-roadmap.md
- deploy/README.md
- docs/operator-runbook.md
- docs/features.md
non-burn runtime? implement P2pWorkload directly.