bytesandbrains 0.3.4

Composable building blocks for decentralized + federated machine learning.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
# Roles — the Contract reference

This document describes the role traits that library authors implement
when shipping a concrete component for the `bytesandbrains` framework.
The **user-facing surface** is the Contract trait family in
[`bb-runtime/src/contracts/`](../bb-runtime/src/contracts/):
`bb::Aggregator`, `bb::Backend`, `bb::Codec`, `bb::DataSource`,
`bb::Index`, `bb::Model`, `bb::PeerSelector`, plus the `Protocol` slot.
Authors implement Contract methods on their concrete struct and pair
the impl with `#[derive(bb::<Role>)]` (or, for protocols, the
`register_protocol!{}` declarative macro). The framework-internal
`<Role>Runtime` traits in `bb-runtime/src/roles/` are bridges the
derive emits — they are not part of the public authoring surface.

## Part 1 — Overview

Each Contract is a small trait carrying a `type Error`, one method
per operation the role surfaces, and (for `bb::Aggregator`) an
associated `type Metadata` that travels alongside the tensor. Every
method takes the same trio (`Backend` excepted — see Part 4):

1. `ctx: &mut RuntimeResourceRef<'_>` — the engine's per-dispatch
   runtime surface. Impls reach declared `#[depends(<role> =
   "<slot>")]` siblings through `ctx.dependency::<T>("<slot>")`;
   `PeerSelector` additionally walks the local `AddressBook` via
   `ctx.peers.addresses` and plans delayed work via `ctx.time`.
2. The op's typed inputs.
3. `completion: CompletionHandle<R, Self::Error>`.

The method returns `ContractResponse<R, Self::Error>` (inline
`Now` vs deferred `Later`).

Every tensor-carrying Contract — `Index`, `Aggregator`, `Model`,
`DataSource`, `Backend`, `Codec` — carries a `Storage`-bound
associated type that declares where in the tensor-type tree this
concrete sits (`TYPE_TENSOR_F32`, `TYPE_TENSOR_U8`, the generic
`TYPE_TENSOR` root, or any user-registered leaf). The compiler reads
`Storage::TYPE` at bind time to stamp `value_info` denotations; the
type solver walks the graph and refuses unbridged mismatches at
compile time. See [`docs/TYPES.md`](TYPES.md) for the `Storage`
trait and `AnyTensor` definitions.

Component authors do not emit `NodeProto`s directly. The DSL surface
that records role calls into a `GraphProto` lives in the **typed
placeholders** under
[`bb-ops/src/placeholders/mod.rs`](../bb-ops/src/placeholders/mod.rs):
the `Backend`, `Model`, `Index`, `Aggregator`, `Codec`,
`DataLoader`, `PeerSelector`, and `Protocol` unit structs. Each
placeholder method records a `NodeProto` under the role's opset
domain (`ai.bytesandbrains.role.<name>`, or `ai.onnx` for Backend)
and stamps `(required_trait, slot_id)` metadata so the compiler can
bind the call to the right concrete impl. Library authors that ship
their own concrete type expose its DSL methods as inherent methods
on the type; those follow the same recording shape.

## Part 2 — The Common Contract Shape

Every Contract method follows the same skeleton:

```rust
fn op(
    &mut self,                                  // (or &self for read-only ops)
    ctx: &mut RuntimeResourceRef<'_>,           // per-dispatch runtime surface
    args: …,                                    // typed args
    completion: CompletionHandle<R, Self::Error>,
) -> ContractResponse<R, Self::Error>;
```

- `ctx` exposes `ctx.dependency::<T>("<slot>")` for declared
  `#[depends(...)]` siblings, `ctx.open_completion::<R, E>()` for
  minting fresh completion handles, plus the address book / scheduler
  / bus surface used by selectors and protocols.
- `ContractResponse::Now(Ok(value))` — the result is ready inline.
  The framework returns
  `DispatchResult::Immediate(vec![(port, Box::new(value) as Box<dyn SlotValue>)])`
  `value` is dropped into the slot table as `Box<dyn SlotValue>` with
  no serialization at this boundary; downstream ops downcast via
  `as_any` — skips the park / ingress-drain cycle, and ignores
  `completion`.
- `ContractResponse::Now(Err(e))` — the call failed synchronously.
  The error is propagated as the dispatch error.
- `ContractResponse::Later` — the impl retained `completion` (handed
  it to a worker thread, spawned a task, queued a remote RPC). The
  framework returns `DispatchResult::Async(handle.cmd_id())` and
  parks the dispatched op until `completion.complete(result)` arrives
  off-thread.

The bridge generated by `#[derive(bb::<Role>)]` wires Contract
methods into the engine's `<Role>Runtime::dispatch_atomic` entry
point — the per-Node dispatch table routes
`(domain, op_type, instance)` tuples through the bridge to the right
Contract method on the bound concrete, forwarding the bridge's `ctx`
parameter into each Contract call.

Contract methods invoked from a `Module::bootstrap` recording run
through the identical dispatch path as body-phase calls — the engine
seeds bootstrap function bodies under a fresh `ExecId`, and the
per-component `is_op_locked` gate
(`bb-runtime/src/engine/core.rs:1762-1806`) parks body-phase ops
that touch any in-flight bootstrap's `ComponentRef` touch set.
Disjoint components keep firing alongside the bootstrap. See
[ENGINE.md §6.8](ENGINE.md#68-host-driven-bootstrap-entry).

`Backend` is the lone exception: its per-op surface, `execute`, and
`dispatch` all stay `ctx`-free for borrow-checker reasons (Part 4).

### Role bindings shared across install targets

`bb::install(.., targets: &[&str], ..)` constructs **one
`ComponentRef` per slot** even when several targets declare the
same slot (`src/install.rs:524-571`). The install path walks every
target's binding entries, groups by slot name, and asserts every
contributor agrees on the same `(TYPE_NAME, role)` pair; a
disagreement surfaces `InstallError::SlotBindingConflict { slot,
conflicts }` enumerating every contributor in call order
(`src/install.rs:142-148`). Concretely:

- A federated peer hosting both `Client` and `Server` partitions
  binds `backend = "compute"` on each target. The compiler stamps
  the same `Backend|CpuBackend|<slot_id>` value under
  `binding.Client.compute` and `binding.Server.compute`; install
  constructs one `CpuBackend`, registers one `ComponentRef`, and
  both partitions' role-op dispatches route through the same
  instance.
- The same applies to `Aggregator`, `Index`, `Codec`,
  `DataSource`, `Model`, `PeerSelector`, and `Protocol` slots —
  any slot a Contract impl reaches through
  `ctx.dependency::<T>("<slot>")` is shared across every target
  declaring the slot. State mutations a Contract method makes
  (an Aggregator's contribution buffer, an Index's storage, a
  Model's optimizer state) are observable to every other target
  sharing the slot, by design.
- Targets that bind the same slot name to different concretes
  fail at install time, not at dispatch — the
  `SlotBindingConflict` walk runs before the install path
  allocates any `ComponentRef`, so a misconfigured deployment
  surfaces a typed error before any concrete is instantiated.

Authors who need a per-target slot identity wire two distinct slot
names: `bind_aggregator::<FedAvg>("client_aggregator")` for the
client partition's aggregator slot and
`bind_aggregator::<FedAvg>("server_aggregator")` for the server
partition's. Two slots, two `ComponentRef`s, no sharing — the
binding table addresses the slot, the dispatch table routes by the
addressed slot's `ComponentRef`.

## Part 3 — `bb::Aggregator`

A federated/decentralized aggregator. `contribute` writes one peer's
update into an in-progress buffer; `aggregate` reduces the buffer and
emits the result paired with typed metadata.

```rust
pub trait Aggregator: Send + Sync {
    /// Storage element type. Most f32-native aggregators declare
    /// `type Element = [f32]`. Use `AnyTensor` for a dtype-agnostic
    /// aggregator that delegates numeric ops to a bound `Backend`.
    type Element: ?Sized + bb_ir::types::Storage;

    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    type Metadata: Clone
        + Default
        + serde::Serialize
        + for<'de> serde::Deserialize<'de>
        + Send + Sync + 'static;

    fn contribute(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        src: PeerId,
        tensor: &Self::Element,
        metadata: Self::Metadata,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    fn aggregate(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<(Box<Self::Element>, Self::Metadata), Self::Error>,
    ) -> ContractResponse<(Box<Self::Element>, Self::Metadata), Self::Error>;
}
```

`type Element` is the storage position. `type Metadata` is the typed
channel hierarchical aggregation rides on: a child `FedAvg`'s
`aggregate` emits `(params, FedAvgMeta { num_samples })`; a parent
layer's `contribute` receives the metadata and uses `num_samples` to
weight the child's contribution. Metadata moves through the slot table
as a typed Rust value; serde fires only when it crosses a wire
boundary. Impls with no metadata channel set `type Metadata = ();`.

The `ctx` parameter is the canonical hook for reaching a bound
`Backend` via `ctx.dependency::<B>("backend")` — that's how
`FedAvg<B>::aggregate` composes its weighted sum from the
backend's `Mul` + `Add` primitives without hardcoding a backend.

**Implementing it.** Pair the Contract impl with
`#[derive(bb::Aggregator)]`. The derive emits the
`AggregatorRuntime` bridge plus the `ConcreteComponent` and
`inventory::submit!` registration; the framework wires the bridge
into the engine's dispatch table.

**DSL surface.** The generic placeholder is
`bb_ops::placeholders::Aggregator`. Its methods record under
`ai.bytesandbrains.role.aggregator`:

- `contribute(g, contribution, metadata) -> Output`
- `aggregate(g, trigger) -> (Output, Output)` — emits
  `(params, metadata)`.

A library author shipping a concrete aggregator exposes equivalent
DSL methods as inherent methods on the type.

## Part 4 — `bb::Backend`

A tensor compute backend. The Contract has **two surfaces**, exposed
side-by-side, and a backend overrides whichever side is natural for
its target. Backend's user-facing methods are the **only** Contract
methods that do not take `ctx` — see *Backend ctx exemption* below.

1. **One typed method per mandatory primitive** — the 30 entries in
   [`bb_ir::tensor_primitives::TENSOR_PRIMITIVES_OPS`]:
   `add`, `sub`, `mul`, `div`, `neg`, `abs`, `sqrt`, `pow`, `exp`,
   `log`, `matmul`, `reduce_sum`, `reduce_mean`, `reduce_max`,
   `reduce_min`, `reshape`, `transpose`, `concat`, `slice`, `split`,
   `squeeze`, `unsqueeze`, `identity`, `cast`, `equal`, `greater`,
   `less`, `r#where`, `constant`, `gather`.
2. **One method to execute a subgraph**   `execute(&GraphProto, HashMap<String, Tensor>, BackendAttrs<'_>)
   -> Result<HashMap<String, Tensor>, Self::Error>`.

```rust
pub trait Backend: Send + Sync {
    type Error: std::error::Error
        + std::fmt::Display
        + Send + Sync
        + From<backend_default_walk::BackendWalkError>
        + 'static;

    type Tensor: Clone + Send + Sync + 'static + bb_ir::types::Storage;

    fn add(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error> { … }
    fn matmul(&self, a: &Self::Tensor, b: &Self::Tensor) -> Result<Self::Tensor, Self::Error> { … }
    // … 28 more per-op methods, each with a default that wraps into `execute`.

    fn execute(
        &self,
        graph: &GraphProto,
        inputs: HashMap<String, Self::Tensor>,
        attrs: BackendAttrs<'_>,
    ) -> Result<HashMap<String, Self::Tensor>, Self::Error> {
        backend_default_walk::execute_graph_via_per_op(self, graph, inputs)
    }
}
```

Defaults in [`bb-runtime/src/contracts/backend_default_walk.rs`](../bb-runtime/src/contracts/backend_default_walk.rs)
bridge the two sides so the author overrides only one:

- A CPU backend overrides the 30 per-op methods (`add` runs ndarray's
  `Add`, `matmul` runs `dot`, …). `execute`'s default walker calls
  the per-op overrides.
- A whole-graph backend (Burn, ORT, Candle, tch) overrides `execute`
  natively (compile the `GraphProto` to native IR, run once). Per-op
  defaults wrap a one-node `GraphProto` and call `execute`.

A backend overriding **neither** side stack-overflows: every per-op
default delegates to `execute`, whose default walks back into per-op.
Backends MUST override at least one side.

Activation functions, pooling, normalization, Conv, and other
extension ops are not on the Contract surface. A backend can declare
extension opsets and handle them in its `execute` override, or a
future lowering pass decomposes them into primitives.

**Backend ctx exemption.** Every other Contract method takes
`ctx: &mut RuntimeResourceRef<'_>` as its first positional
parameter; Backend's user-facing methods do not. The canonical
pattern that motivates `ctx` —
`let backend = ctx.dependency::<B>("backend")?; backend.mul(&a,
&b)?;` — borrows `&B` from `ctx`, and a `mul` signature that took
`&mut ctx` would re-borrow it while the previous borrow is still
live (E0502). Backend is the terminal dependency in the injection
chain (a leaf, not a composition seam), so the exemption costs
nothing: kernels stay pure tensor functions. The derive-emitted
`BackendRuntime::dispatch_atomic` bridge still receives `ctx` and
threads `current_node_attributes` + `current_node_metadata` into a
`BackendAttrs<'_>` for the `execute` override.

**Implementing it.** Pair the Contract impl with
`#[derive(bb::Backend)]`. The derive emits the `BackendRuntime`
bridge; `dispatch_atomic` routes each `ai.onnx::*` call through
`Backend::execute` on a single-node `GraphProto` so per-op overrides
on the Contract receive the dispatch automatically.

**DSL surface.** `bb_ops::placeholders::Backend` ships the full
`ai.onnx v1` DSL catalog (~48 methods including the 30 primitives,
common activations, normalization, conv/pool, gather/scatter, and
`If`/`Loop` subgraph carriers). Each method records an `ai.onnx::*`
`NodeProto`; the compiler's subgraph-collapsing pass fuses
contiguous `ai.onnx` runs into `BackendSubgraph` calls that the
bound backend's `execute` consumes.

### Backend-owned tensor memory

Tensors flowing through a `Backend`-bound slot are **thin
Arc-style handles** around backend-managed buffers (think
`std::shared_ptr`). The backend allocates, owns, and is free to
pool / reuse / free the underlying buffer; the framework holds
the handle. `Backend::Tensor: Clone + Send + Sync + 'static`
makes the handle cheap to copy across slots
(`Clone` becomes `Arc::clone` for pooling-friendly backends).

`CpuBackend` is the canonical in-tree implementer:

```rust
pub struct CpuTensor(pub(crate) Arc<CpuBackendBuffer>);

pub(crate) struct CpuBackendBuffer {
    pub(crate) data: ArrayD<f32>,
    pub(crate) dims_i64: Vec<i64>,
    pub(crate) charged_bytes: usize,
}
```

(`bb-ops/src/backends/cpu/tensor.rs:44-65`.) `Clone` is `Arc::clone`
(O(1) refcount bump); FedAvg's per-peer `tensor.clone()`
(`bb-ops/src/aggregators/fedavg/mod.rs`) costs one atomic
increment, not a `Vec<f32>` deep copy.

### `materialize_from_wire` Contract method

```rust
fn materialize_from_wire(
    &self,
    type_hash: u64,
    bytes: Vec<u8>,
) -> Result<Self::Tensor, Self::Error>;
```

(`bb-runtime/src/contracts/backend.rs:497-522`.) The framework
calls this when a tensor `SlotFill` arrives at a slot whose
binding is a `Backend` role. Lifecycle:

1. Wire-decode runs `EnvelopeCaps::max_per_fill_bytes` cap +
   `Engine::ingress_byte_budget` `try_charge` before the
   backend sees anything (Principle 1).
2. The framework `mem::take`s `fill.payload` (already
   framework-owned from envelope decode) and hands the
   `Vec<u8>` to `materialize_from_wire` **by value**   ownership transfer, not a borrow.
3. The backend may adopt the bytes zero-copy
   (`ArrayD::from_shape_vec` when alignment permits), pull a
   buffer from a pool and copy in, or fresh-allocate. The
   framework will not touch `bytes` after the call returns.
4. On `Ok(tensor)` the engine wraps the result in
   `BackendTensorCarrier`
   (`bb-runtime/src/slot_value.rs:43-174`) and stamps the
   accounting fields (`charged_bytes`, `backend_ref`).
5. On `Err` the engine releases the byte charge, drops the
   fill, and emits
   `InfraEvent::WireReceiveError::BackendMaterializeFailed`.

**Ownership rationale.** `Vec<u8>` by value (not `&[u8]` or
`Cow`). This is the framework-to-backend handoff, NOT an
external boundary — Principle 1a (ephemeral borrowed slices)
applies to transport ingress, not to framework-internal handoffs.
The backend lives inside the framework ecosystem and plays by
the runtime contract.

**Default impl.** The trait provides a default that delegates to
the global `wire_decoder_registry()`: look up the decoder for
`type_hash`, run it on the bytes, downcast the resulting
`Box<dyn SlotValue>` to `Self::Tensor`. Backends without tensor
pooling work through this default; backends that override pay
the registry hop only at override time. The derive bridge in
`#[derive(bb::Backend)]` (`bb-derive/src/roles.rs:368-389`)
generates the `BackendRuntime::materialize_from_wire` forwarding
shim automatically.

**`bb::Backend` is the only role with a dedicated wire-materialise
hook.** Other roles (`Aggregator`, `Index`, `Model`, …) receive
their tensors through `RuntimeResourceRef::dependency::<B>()`
already materialised by the bound backend; they never see a
`Vec<u8>` from the wire directly.

## Part 5 — `bb::Codec`

A typed in/out storage bridge — quantizers (affine int8, PQ), dtype
lifts (f32 ↔ f16), opaque-bytes compressors (zstd). `Codec` is the
only Contract with two `Storage`-bound associated types because it
bridges two positions in the tensor-type tree. Authors wire it
explicitly when an upstream output type doesn't unify with a
downstream port type; the compiler reports the mismatch and the
author chooses the appropriate `Codec` impl.

```rust
pub trait Codec: Send + Sync {
    /// Input storage position.
    type In: ?Sized + bb_ir::types::Storage;

    /// Output storage position. Different position from `In`
    /// (an identity bridge carries no value — remove it instead).
    type Out: ?Sized + bb_ir::types::Storage;

    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    /// Optional training pass (calibration for quantizers, k-means
    /// for PQ codebooks). Plain dtype casts skip this.
    /// Default returns `Now(Ok(()))`.
    fn train(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        samples: &[&Self::In],
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error> { ContractResponse::Now(Ok(())) }

    /// `In → Out`.
    fn encode(
        &self,
        ctx: &mut RuntimeResourceRef<'_>,
        input: &Self::In,
        completion: CompletionHandle<Box<Self::Out>, Self::Error>,
    ) -> ContractResponse<Box<Self::Out>, Self::Error>;

    /// `Out → In`. Lossy codecs implement the best-effort inverse.
    fn decode(
        &self,
        ctx: &mut RuntimeResourceRef<'_>,
        encoded: &Self::Out,
        completion: CompletionHandle<Box<Self::In>, Self::Error>,
    ) -> ContractResponse<Box<Self::In>, Self::Error>;
}
```

Example — f32 → u8 affine quantizer:

```rust
#[derive(bb::Concrete, bb::Codec)]
struct Int8AffineQuantizer { scale: f32, zero_point: i32 }

impl Codec for Int8AffineQuantizer {
    type In  = [f32];          // TYPE_TENSOR_F32
    type Out = [u8];           // TYPE_TENSOR_U8
    type Error = QuantizeError;
    fn train(&mut self, ctx, samples, …)  { /* compute scale + zero_point */ }
    fn encode(&self, ctx, x, …)           { /* affine quantize */ }
    fn decode(&self, ctx, y, …)           { /* affine dequantize */ }
}
```

A codec that materializes calibration tensors on-device reaches the
bound Backend through `ctx.dependency::<MyBackend>("backend")` —
the same dep-injection chain every non-Backend role uses.

`train(samples)` runs once per codec instance to fit the bridge.
Affine int8 quantizers compute `(scale, zero_point)` from the
sample slice; PQ codecs run k-means per sub-vector to build the
codebooks; plain dtype casts (`f32 → f16`, `bf16 → f32`) skip the
call. The same bootstrap-vs-barrier ordering options apply as
with `Index::train`: record the call inside `Module::bootstrap`
to gate body-phase `encode` / `decode` ops on training
completion (the `is_op_locked` gate parks every body op touching
the bound Codec — see [Part 11](#part-11--bbbootstrap)), or wire
the trigger through a `bb.barrier`.

**Implementing it.** Pair with `#[derive(bb::Codec)]`.

**DSL surface.** `bb_ops::placeholders::CodecSlot` records under
`ai.bytesandbrains.role.codec`:

- `encode(g, input) -> Output`
- `decode(g, encoded) -> Output`
- `train(g, samples) -> Output` (`TYPE_TRIGGER`)

## Part 6 — `bb::DataSource`

A data source / data loader. Produces batches into the Module.

```rust
pub trait DataSource: Send + Sync {
    /// Sample storage type. Covers both the batch tensor and the
    /// optional labels tensor. Implement as `[f32]` for flat f32
    /// sample batches.
    type Sample: ?Sized + bb_ir::types::Storage;

    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    fn next_batch(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<(Box<Self::Sample>, Box<Self::Sample>), Self::Error>,
    ) -> ContractResponse<(Box<Self::Sample>, Box<Self::Sample>), Self::Error>;

    fn reset(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    fn on_data_loaded(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;
}
```

`next_batch` returns `(batch, labels)` as boxed `Self::Sample` slices;
the second slot is zero-length for unsupervised sources.
`on_data_loaded` is a one-shot notification a source fires once its
data is ready to read (e.g. dataset download complete). A source
that lands its batch tensors on a device-resident backend reaches
the bound concrete via `ctx.dependency::<MyBackend>("backend")`.

**Implementing it.** Pair with `#[derive(bb::DataSource)]`.

**DSL surface.** `bb_ops::placeholders::DataLoader` records under
`ai.bytesandbrains.role.data_source`:

- `next_batch(g) -> (Output, Output)``(batch, labels)`.
- `reset(g, trigger) -> Output`
- `on_data_loaded(g) -> Output`

## Part 7 — `bb::Index`

A vector index. Wraps an in-process structure (FAISS, ScaNN), a
database (SQLite + extensions, pgvector), or a custom impl.

```rust
pub trait Index: Send + Sync {
    /// Vector storage. Pick the position in the type tree:
    /// `[f32]` for an f32-native index, `AnyTensor` for an
    /// algorithm-class index that outsources distance math to
    /// a bound `Backend`, a custom type for specialized dtypes.
    type Vector: ?Sized + bb_ir::types::Storage;

    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    fn add(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        vec: &Self::Vector,
        completion: CompletionHandle<u64, Self::Error>,
    ) -> ContractResponse<u64, Self::Error>;

    fn search(
        &self,
        ctx: &mut RuntimeResourceRef<'_>,
        query: &Self::Vector,
        k: u32,
        completion: CompletionHandle<Vec<(u64, f32)>, Self::Error>,
    ) -> ContractResponse<Vec<(u64, f32)>, Self::Error>;

    fn remove(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        id: u64,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    /// Optional training pass. IVF needs centroid k-means; Product
    /// Quantization (PQ) needs sub-vector codebook learning; flat
    /// and hand-tuned indexes skip it. Default returns
    /// `Now(Ok(()))` so impls that do not train pay zero cost.
    fn train(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        samples: &[&Self::Vector],
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error> { ContractResponse::Now(Ok(())) }
}
```

An algorithm-class index (e.g. an HNSW shell that delegates
distance math) declares `#[depends(backend = "<slot>")]` and
reaches the bound backend through `ctx.dependency::<B>("<slot>")`
inside `search`. See `examples/component_with_dependency.rs` for
the worked pattern.

`train(samples)` runs once per index instance ahead of `add`
traffic. IVF impls compute centroids over the sample slice and
keep them as the coarse quantizer; PQ impls learn one codebook
per sub-vector and keep them as the encoder. Authors gate body
`add` / `search` ops on training completion either by recording
the call inside `Module::bootstrap` (the per-component
`is_op_locked` gate parks `add` / `search` ops touching the
bound Index until the bootstrap drains — see
[Part 11](#part-11--bbbootstrap)) or by wiring the returned
trigger into a `bb.barrier`.

**Implementing it.** Pair with `#[derive(bb::Index)]`.

**DSL surface.** `bb_ops::placeholders::IndexSlot` records under
`ai.bytesandbrains.role.index`:

- `add(g, vec) -> Output`
- `search(g, query, k) -> Output`
- `remove(g, id) -> Output`
- `train(g, samples) -> Output` (`TYPE_TRIGGER`)

## Part 8 — `bb::Model`

An ML model. Forward / backward / optimizer step / parameter
snapshot.

```rust
pub trait Model: Send + Sync {
    /// Tensor storage type. One associated type covers
    /// input, output, params, grad, and delta.
    /// Implement as `[f32]` for flat f32 tensors.
    /// Mixed-precision models wire `Codec` nodes around the model
    /// rather than multiplying associated types per port.
    type Tensor: ?Sized + bb_ir::types::Storage;

    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    fn forward(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        input: &Self::Tensor,
        completion: CompletionHandle<Box<Self::Tensor>, Self::Error>,
    ) -> ContractResponse<Box<Self::Tensor>, Self::Error>;

    fn load_parameters(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        params: &Self::Tensor,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    fn backward(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        grad: &Self::Tensor,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    fn apply_delta(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        delta: &Self::Tensor,
        completion: CompletionHandle<(), Self::Error>,
    ) -> ContractResponse<(), Self::Error>;

    /// Loss is always a framework-fixed `f32` scalar regardless of
    /// the tensor element type.
    fn compute_loss(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        input: &Self::Tensor,
        target: &Self::Tensor,
        completion: CompletionHandle<f32, Self::Error>,
    ) -> ContractResponse<f32, Self::Error>;

    fn params(
        &self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<Box<Self::Tensor>, Self::Error>,
    ) -> ContractResponse<Box<Self::Tensor>, Self::Error>;
}
```

`params` returns an owned `Box<Self::Tensor>` snapshot — async
serialization needs owned values. A model whose forward pass runs
on a bound `Backend` reaches the backend through
`ctx.dependency::<B>("<slot>")` and composes the per-op surface
inside `forward` / `backward`.

**Implementing it.** Pair with `#[derive(bb::Model)]`.

**DSL surface.** `bb_ops::placeholders::Model` records under
`ai.bytesandbrains.role.model`:

- `forward(g, input) -> Output`
- `load_parameters(g, params) -> Output`
- `backward(g, grad) -> Output`
- `apply_delta(g, delta) -> Output`
- `compute_loss(g, input, target) -> Output`
- `params(g) -> Output`

## Part 9 — `bb::PeerSelector`

A peer-selection protocol. The framework's gossip overlay provides
one impl; users needing a custom view (constant, weighted,
geographic) write their own.

```rust
pub trait PeerSelector: Send + Sync {
    type Error: std::error::Error + std::fmt::Display + Send + Sync + 'static;

    fn select(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        params: SelectParams,
        completion: CompletionHandle<Vec<PeerId>, Self::Error>,
    ) -> ContractResponse<Vec<PeerId>, Self::Error>;

    fn sample(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        n: u32,
        completion: CompletionHandle<Vec<PeerId>, Self::Error>,
    ) -> ContractResponse<Vec<PeerId>, Self::Error> {
        self.select(ctx, SelectParams::Random { n }, completion)
    }

    fn current_view(
        &mut self,
        ctx: &mut RuntimeResourceRef<'_>,
        completion: CompletionHandle<Vec<PeerId>, Self::Error>,
    ) -> ContractResponse<Vec<PeerId>, Self::Error>;
}

pub enum SelectParams {
    Random { n: u32 },
    NearKey { key: Vec<u8>, n: u32 },
    All,
}
```

Selector impls read `ctx.peers.addresses` to walk the local
`AddressBook`, write through it for membership updates from the
`dispatch_atomic` arm (`Announce` / `Forget`), and reach the
scheduler via `ctx.time` when planning a delayed probe. Declared
dependencies are reached via `ctx.dependency::<T>("<slot>")` —
the same surface every other non-Backend Contract uses.

`SelectParams` is an open enum — new variants are additive. Concrete
impls handle the variants they support and surface
`ContractResponse::Now(Err(_))` for unsupported variants (e.g. a DHT
view handles `NearKey`; a fixed-list view handles only `All`).
`sample` defaults to `select(SelectParams::Random { n })`; impls may
override it for an optimized fast path.

**Implementing it.** Pair with `#[derive(bb::PeerSelector)]`.

**DSL surface.** `bb_ops::placeholders::PeerSelector` records under
`ai.bytesandbrains.role.peer_selector`:

- `sample(g, n) -> Output<PeerId>`
- `current_view(g) -> Output<PeerId>`

The placeholder carries a `class: &'static str` that tags every
emitted `Output<PeerId>` with the peer class it samples from. The
compiler's class-inference pass reads the tag so downstream
`wire.send`s flow to the right destination class — that's how a
gossip self-send partitions correctly. Construct as
`PeerSelector::of_class("gossip_peer")` to retarget; `Default`
selects `bb_ir::peer_class::SELF_CLASS`.

## Part 10 — The Protocol Slot

The `Protocol` role hosts bring-your-own-protocol implementations
(Kademlia, Chord, custom overlays). Unlike the other seven roles,
protocols do not share a fixed verb catalog — each protocol declares
its own atomic opset (`<crate>.<ProtocolName>.atomic v<n>`) and the
op-types in that opset are protocol-specific.

For this reason there is **no `bb::Protocol` Contract trait** and no
`#[derive(bb::Protocol)]`. The user-facing authoring path is the
declarative macro
[`bb::register_protocol!{}`](../bb-derive/src/lib.rs), which writes the
protocol struct's serde impls, `ConcreteComponent` impl,
`AnyComponent` impl, framework-internal `ProtocolRuntime` impl,
`atomic_opset` declaration, `dispatch_atomic` body, and inventory
submission in one block:

```rust
bb::register_protocol! {
    struct Kademlia { routing_table: Vec<u64>, k: usize }
    domain: "bb-kademlia.kademlia.atomic"
    version: 1
    ops {
        FindNode,
        Ping,
    }
}
```

The `ProtocolRuntime` trait the macro generates carries the same
pair every other role has: `atomic_opset()` declaring the
protocol's op set, and `dispatch_atomic(op_type, inputs, ctx)`
routing op types to Rust bodies. For inbound envelopes the framework
synthesizes the dispatch inputs from the wire envelope (peer id,
raw payload bytes, correlation handle); for user-graph DSL ops the
inputs come from upstream slot values exactly like any role op.

The DSL placeholder for the slot is
`bb_ops::placeholders::Protocol`. It carries no DSL methods of its
own — protocols surface their per-op DSL methods on the concrete
struct, emitted by `register_protocol!{}` alongside the runtime
bridge. The placeholder exists solely so Modules can declare a
generic Protocol slot the compiler chain binds at compile time.

See [WIRE.md](WIRE.md) for the wire envelope shape and a worked
Gossip-protocol example.

## Part 11 — `bb::Bootstrap`

The optional Component initialization phase. Every Component
(every `#[derive(bb::Concrete)]` type) participates implicitly —
the derive emits a default no-op `impl Bootstrap`
(`bb-derive/src/roles.rs:46-79`,
`bb-runtime/src/contracts/bootstrap.rs:54-67`) so most concretes
need zero boilerplate. Authors **override** when a Component
needs to allocate resources, mmap state, prime a calibration
buffer, or otherwise stage work before any of its other Contract
methods runs.

```rust
pub trait Bootstrap {
    type Error: std::error::Error + Send + Sync + 'static;

    fn bootstrap(&mut self, _ctx: &mut BootstrapCtx)
        -> Result<(), Self::Error>
    {
        Ok(())
    }
}
```

The host fires Component bootstraps explicitly via
`Node::run_bootstrap(BootstrapTarget::Slots(&[slot, ...]))`
(`bb-runtime/src/node/mod.rs`). The engine resolves
`slot → ComponentRef`, allocates a fresh `ExecId`, locks the
`{cref}` touch set on `bootstrap.in_flight`, and invokes the
override through the per-T dispatcher registry the derive
registered. Disjoint Component bootstraps fire concurrently —
the `is_op_locked` gate parks only the touched components, so
body ops on disjoint slots keep firing during a Component
bootstrap.

`DispatchResult::Immediate(_)` retires the in-flight entry
synchronously. `DispatchResult::Async(cmd_id)` parks the body
`ExecId` on `pending_async`; the impl's later
`ctx.complete_command(cmd_id, ...)` drives the drain through
the regular `handle_completion` path.

### When a Concrete should override `Bootstrap`

Override when:

- **Backend pools.** The Backend allocates pinned host buffers,
  GPU streams, or a kernel cache before body ops issue tensor
  work.
- **Index mmap / file-backed state.** The Index opens its
  on-disk store, validates the header, and primes any in-memory
  caches before `add` / `search` ops fire.
- **Codec calibration.** A quantization codec pulls a calibration
  sample from its bound `DataSource` and computes
  `(scale, zero_point)` before `encode` / `decode` ops fire.
- **Protocol kademlia bootstrap.** A protocol Component contacts
  its seed peers to populate the routing table before the body
  phase emits `FindNode` traffic.
- **Async one-shot setup.** Any setup that returns
  `ContractResponse::Later` so the engine can park the body
  phase while the work completes off-thread.

Skip the override when the Component is purely reactive — a
stateless Aggregator, a `Backend` whose tensor pool lazy-allocates
on first kernel call, a `DataSource` that loads from an in-memory
buffer constructed at install. The default no-op runs through
the dispatcher just like any other Contract method; the
`is_op_locked` gate clears immediately so body ops fire as soon
as the host kicks the queue.

**DSL note.** A Component-level `Bootstrap` override is **not**
the same as a `Module::bootstrap` recording. The former runs
Rust code once when the host fires the slot (no DSL recording,
no FunctionProto); the latter records a `__bootstrap`
FunctionProto whose body ops dispatch as normal Contract methods.
Modules that need *graph-expressed* one-shot setup (e.g.
`Index::train(g, samples)`) record it inside `Module::bootstrap`;
Components that need *Rust-expressed* one-shot setup (e.g. mmap
a file) implement the `Bootstrap` Contract. The two paths
coexist — a Module bootstrap that calls `Index::train` dispatches
through the Index's Contract methods, which run only after the
Index's `Bootstrap` override completes (the seed order is host-
driven).

See [ENGINE.md §6.8](ENGINE.md#68-host-driven-bootstrap-entry)
for the engine plumbing,
[CONTRACT_DISPATCH.md](CONTRACT_DISPATCH.md#bootstrap-is-just-another-contract-method)
for the derive bridge,
[AUTHORING_COMPONENTS.md](AUTHORING_COMPONENTS.md#authoring-a-component-level-bootstrap)
for an authoring walkthrough.

## Cross-references

- [API_DESIGN.md]API_DESIGN.md — Module → Compiler → Node
  three-phase construction.
- [AUTHORING_COMPONENTS.md]AUTHORING_COMPONENTS.md — long-form
  walkthrough of writing a concrete component.
- [COMPILER.md]COMPILER.md — compilation pipeline (18 passes) that
  binds recorded role-op `NodeProto`s to the concrete impls bound
  via the compiler chain.
- [CONTRACT_DISPATCH.md]CONTRACT_DISPATCH.md — Contract-method
  dispatch and the `dispatch_atomic` bridge design.
- [IR_AND_DSL.md]IR_AND_DSL.md — DSL → ONNX `ModelProto`, role
  opset catalog, and the per-op IO contracts.
- [WIRE.md]WIRE.md — wire envelope and protocol authoring.