ferrompi 0.4.1

A safe, generic Rust wrapper for MPI with support for MPI 4.0+ features, shared memory windows, and hybrid MPI+OpenMP
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
# ferrompi Architecture

## Overview

This document is the canonical reference for ferrompi's internal architecture.
Its audience is contributors investigating implementation details, not end users
seeking usage guidance (which lives in `README.md`). It explains why a
hand-written C wrapper exists, how handle tables are organised and made
thread-safe, which invariants must be preserved when adding new MPI entry
points, and how the sealed-trait type system enforces datatype safety across
the Rust/C boundary.

## Layer Diagram

ferrompi sits between application Rust code and an underlying MPI runtime
through a deliberate six-layer stack. Each layer has a clearly bounded
responsibility; crossing a layer boundary requires following the invariants
described in the sections below.

```mermaid
graph TD
    A["Rust application code\n(user crate, examples/)"]
    B["Public ferrompi API\n(src/lib.rs, src/comm/*, src/window.rs,\nsrc/group.rs, src/datatype_builder.rs, src/op.rs)"]
    C["FFI declarations\n(src/ffi.rs — extern \"C\" blocks)"]
    D["C wrapper layer\n(csrc/ferrompi.c — 4629 LOC\ncsrc/ferrompi.h — 2171 LOC)"]
    E["MPI implementation\n(MPICH / Open MPI / Cray MPT)"]
    F["MPI runtime\n(process manager, network fabric, RDMA HW)"]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
```

| Layer                    | Responsibility                                                                                                 |
| ------------------------ | -------------------------------------------------------------------------------------------------------------- |
| Rust application         | Calls ferrompi's public API; owns all live data.                                                               |
| Public ferrompi API      | Type safety, RAII drop semantics, `Result` error mapping, sealed-trait enforcement.                            |
| `src/ffi.rs`             | Thin `extern "C"` declarations — no logic, only signatures.                                                    |
| `csrc/ferrompi.c` / `.h` | Handle tables, large-count branching, `MPI_UNDEFINED` normalisation, op trampolines, runtime constant queries. |
| MPI implementation       | Collective algorithms, point-to-point transport, window coherence.                                             |
| MPI runtime              | Process launch, rank assignment, network fabric, memory registration.                                          |

The split between the C layer and the public Rust API is deliberate and
explained in full in `adr/0001-why-c-wrapper.md`. The short answer: several MPI
idioms (opaque handle tables, `_c` large-count branching, implementation-defined
constant values) require C11 constructs that cannot be expressed portably in
Rust FFI without replicating the C layer anyway.

## Handle Tables

The C layer in `csrc/ferrompi.c` owns seven fixed-size static tables that map
compact integer handles — passed over the FFI boundary as `int32_t` or
`int64_t` — to the opaque MPI handle types that the MPI library returns from
its constructor functions. MPI handles (`MPI_Comm`, `MPI_Request`, etc.) are
not stable integers; their values are implementation-defined and may change
across library versions. Storing them in a table and exposing only the table
index to Rust insulates the Rust layer from this instability.

| Table            | Capacity | Element type   | Occupancy tracking               |
| ---------------- | -------- | -------------- | -------------------------------- |
| `comm_table`     | 256      | `MPI_Comm`     | `MPI_COMM_NULL` sentinel         |
| `request_table`  | 16384    | `MPI_Request`  | `atomic_int request_used[16384]` |
| `win_table`      | 256      | `MPI_Win`      | `int win_used[256]`              |
| `info_table`     | 64       | `MPI_Info`     | `int info_used[64]`              |
| `group_table`    | 64       | `MPI_Group`    | `int group_used[64]`             |
| `datatype_table` | 64       | `MPI_Datatype` | `int datatype_used[64]`          |
| `op_table`       | 16       | `MPI_Op`       | `atomic_int op_used[16]`         |

Each table has a corresponding `alloc_*` / `free_*` / `get_*` function triple.
Allocation scans from a cached hint index and returns the first available slot;
freeing clears the occupancy marker and resets the handle to its null sentinel.
The hint is advisory (not correctness-critical) and advances after each
successful allocation to amortise scan cost.

### Request table — C11 atomic slot-claim

The `request_table` is by far the most active table: every non-blocking send,
receive, and collective operation allocates a slot. Under
`MPI_THREAD_MULTIPLE`, multiple threads may call `ferrompi_isend`,
`ferrompi_irecv`, and the wait/test family concurrently. A plain read-modify-
write on `request_used` would be a data race, allowing two threads to claim the
same slot silently.

`alloc_request` uses `atomic_compare_exchange_strong_explicit` with
`memory_order_acq_rel` on success: exactly one thread wins the CAS per slot per
attempt; the loser advances to the next index. `free_request` uses
`atomic_store_explicit` with `memory_order_release`; `get_request_ptr` uses
`atomic_load_explicit` with `memory_order_acquire`. This acquire/release pairing
ensures a thread that observes `request_used[i] != 0` also observes the
`request_table[i]` value written by the allocating thread. The full analysis,
including rejection of a pthread-mutex approach and a lock-free Treiber stack,
is in `adr/0002-handle-tables.md`.

The same C11 atomic pattern applies to `op_table` (parallel `atomic_int
op_used[MAX_OPS]` with the same CAS strategy), because user-defined ops may
also be created concurrently under `MPI_THREAD_MULTIPLE`.

### Slot 0 reservations and drop guards

`group_table[0]` is reserved for `MPI_GROUP_EMPTY` and is initialised once in
`init_tables`. The `Drop` implementation for `Group` guards `handle > 0` before
calling `ferrompi_group_free`, preventing a free of the reserved slot.

`datatype_table` has no reserved slot; `Drop` for `CustomDatatype` guards
`handle >= 0`. `op_table` slot allocation begins at index 0 with no reservation;
`Drop` for `UserOp` guards `op_handle >= 0` and calls `MPI_Op_free` before
clearing the slot (see "Thread-Safety Model" below for the ordering requirement).

### Open hardening work

`win_table`, `info_table`, and `comm_table` currently use plain `int used[]`
arrays with non-atomic read-modify-write. These have the same structural data
race under `MPI_THREAD_MULTIPLE` that was fixed in the request table. Hardening
them follows the ticket-023 pattern documented in `adr/0002-handle-tables.md`
and is tracked as an open issue in `plans/ferrompi-gap-closure/learnings/epic-07-summary.md`.

## Thread-Safety Model

ferrompi maps its `ThreadLevel` enum to MPI's four standard thread levels:

| `ThreadLevel` variant | MPI constant            | Meaning                                                     |
| --------------------- | ----------------------- | ----------------------------------------------------------- |
| `Single`              | `MPI_THREAD_SINGLE`     | Only one thread in the process.                             |
| `Funneled`            | `MPI_THREAD_FUNNELED`   | Only the thread that called `MPI_Init_thread` may call MPI. |
| `Serialized`          | `MPI_THREAD_SERIALIZED` | Multiple threads may call MPI, but not concurrently.        |
| `Multiple`            | `MPI_THREAD_MULTIPLE`   | Multiple threads may call MPI concurrently.                 |

`ThreadLevel` is selected at init time via `Mpi::init_thread(level)`. The
requested level is advisory; the MPI implementation may grant a lower level,
which ferrompi reflects in the returned `ThreadLevel`.

### `Communicator` is `Send + Sync`

`Communicator` implements both `Send` and `Sync`. The underlying `MPI_Comm`
handle is an integer index into `comm_table`; it carries no thread-affinity and
is valid to use from any thread, provided the MPI implementation was initialised
with at least `MPI_THREAD_FUNNELED`. Under `MPI_THREAD_MULTIPLE`, any thread
may call any `Communicator` method concurrently.

### Synchronisation in the C layer under `Multiple`

Under `MPI_THREAD_MULTIPLE`, the slot-claim CAS in `alloc_request` (and the
parallel CAS in `alloc_op`) is the **only** synchronisation primitive in the C
wrapper layer. All other correctness guarantees rely on the MPI implementation's
own thread-safety contract: if MPI reports `MPI_THREAD_MULTIPLE`, it guarantees
that concurrent calls to its own functions on the same communicator are safe.
ferrompi does not add mutexes, condition variables, or barriers beyond the
handle-table atomics.

### `UserOp` closure thread-safety contract

`UserOp<T>` wraps a user-supplied Rust closure that MPI will invoke during
reduction operations. The closure is stored in the C-side `op_closure_data` /
`op_closure_vtbl` fat-pointer pair and may be called from any thread under
`MPI_THREAD_MULTIPLE` — including an MPI-internal thread-pool thread that the
application did not create.

The closure must satisfy `F: Fn(&[T], &mut [T]) + Send + Sync + 'static`. All
three bounds are mandatory and enforced at compile time (see
`adr/0005-mpi-op-create.md` Decision 2):

- `Send` — the closure is moved into a global static table accessible from any thread.
- `Sync` — concurrent invocations on the same `MPI_Op` (e.g., two concurrent `allreduce` calls on different communicators) must not produce data races on the closure's captured state.
- `'static` — the closure is stored in a raw pointer in the C static segment; any borrow shorter than `'static` could be invalidated while the `MPI_Op` handle is still live.

Captures using `Rc<T>`, `Cell<T>`, `RefCell<T>`, raw pointers, or non-`'static`
references are rejected at compile time. Callers requiring shared mutable
closure state must use `Arc<Mutex<T>>` or `Arc<RwLock<T>>`.

### Drop ordering for `UserOp`

`Drop for UserOp<T>` calls `MPI_Op_free` first, then releases the closure slot.
This ordering is mandatory: after `MPI_Op_free` returns, the MPI library
guarantees it will never invoke the user function pointer again; only then may
the closure storage be freed. Reversing the order — dropping the closure before
freeing the op — is a use-after-free if any in-flight collective is still
dispatching the trampoline. The full rationale is in `adr/0005-mpi-op-create.md`
Decision 3.

## C Layer Scope

The C wrapper is intentionally narrow. The table below distinguishes what
belongs in C from what belongs in Rust.

### What goes in C (`csrc/ferrompi.c`)

- **Handle tables**`comm_table`, `request_table`, `win_table`, `info_table`,
  `group_table`, `datatype_table`, `op_table` and their `alloc_*` / `free_*` /
  `get_*` triples. MPI opaque handles cannot be stored in Rust without copying
  the entire allocation strategy anyway.
- **Large-count branching** — shims for non-blocking and custom p2p operations
  inspect `MPI_VERSION` at compile time and dispatch to `MPI_*_c` variants
  (which accept `MPI_Count` rather than `int`) when `MPI_VERSION >= 4`. This
  branching requires C preprocessor guards that would be unreadable as inline
  Rust assembly or build-script code generation.
- **`MPI_UNDEFINED` normalisation** — MPI does not standardise the integer value
  of `MPI_UNDEFINED`. MPICH uses `-32766`; Open MPI uses `-1`. Every shim that
  may return `MPI_UNDEFINED` (e.g., `ferrompi_group_translate_ranks`) normalises
  it to `-1` before returning to Rust, so the Rust layer can use a single
  sentinel value.
- **`install_errors_return`** — called on every newly-created communicator handle
  to set the error handler to `MPI_ERRORS_RETURN`, converting MPI errors from
  process-aborting signals into return codes that ferrompi can translate to
  `Err(Error::Mpi { .. })`.
- **Op trampolines** — 16 distinct C functions `ferrompi_user_op_trampoline_0`
  through `ferrompi_user_op_trampoline_15`, generated by a preprocessor macro.
  Each bakes its slot index into the function body so that the closure pointer
  can be retrieved from `op_closure_data[N]` without a side channel. See
  `adr/0005-mpi-op-create.md` Decision 5 for why a single trampoline with
  thread-local dispatch is unsafe under `MPI_THREAD_MULTIPLE`.
- **Runtime implementation-defined constants**`MPI_MODE_NOSTORE`,
  `MPI_MODE_NOPUT`, `MPI_MODE_NOPRECEDE`, `MPI_MODE_NOSUCCEED`, and the
  analogous PSCW assert constants are queried once via C shims and cached in
  Rust via `OnceLock<[i32; N]>`. Hardcoding them in Rust would be incorrect
  because their values are implementation-defined.

### What stays in Rust (`src/`)

- **Type safety** — the sealed-trait families `MpiDatatype`,
  `AtomicMpiDatatype`, `MpiIndexedDatatype`, and `BytePermutable` (see
  "Generic-over-`MpiDatatype` Design") ensure that only valid Rust types reach
  MPI entry points. The C layer accepts raw integers and cannot enforce this.
- **RAII drop semantics**`Communicator`, `Request`, `PersistentRequest`,
  `Group`, `CustomDatatype`, `Win`, `UserOp`, and the RMA lock guards all
  implement `Drop`. Callers cannot forget to free handles; the Rust borrow
  checker enforces lifetime containment.
- **`Error` mapping**`Error::check_with_op(ret, "<tag>")` converts every
  non-zero MPI return code into a structured `Error::Mpi { class, code,
message, operation }`. The `operation` tag is the C function name with the
  `ferrompi_` prefix stripped, enabling precise error attribution.
- **`OnceLock`-cached version probes** — MPI 4.0+ features (e.g.,
  `Mpi::create_from_group`) are gated by `OnceLock<bool>` probes that parse
  `Mpi::version()` at first call and cache the result. The probe lives in Rust
  because version queries are pure Rust logic; the underlying MPI version
  integer is available without FFI.
- **`catch_unwind + abort` panic fence** — the `rust_user_op_invoke`
  `extern "C"` entry point in `src/op.rs` wraps every closure invocation in
  `std::panic::catch_unwind`. If the closure panics, the process aborts
  immediately. Panicking across the FFI boundary is undefined behaviour; silent
  data corruption in a collective result is worse than a loud process abort for
  HPC use cases. See `adr/0005-mpi-op-create.md` Decision 6.
- **`WinLockGuard` / `WinLockAllGuard` RAII epoch guards** — passive-target
  RMA epochs (lock/unlock) are represented as RAII guards that carry `flush`
  and `flush_all` as inherent methods. This prevents calling `flush` outside a
  passive-target epoch, which is a misuse MPI does not check at runtime.

The rationale for this split — rather than writing pure Rust FFI without any C
intermediary — is documented in `adr/0001-why-c-wrapper.md`.

## FFI / ABI Invariants

The following invariants are established across the seven epics and must be
preserved by every new MPI entry point added to ferrompi.

### `#[repr(i32)]` enums with explicit discriminants

`ReduceOp` and `DatatypeTag` carry `#[repr(i32)]` with explicit `= N`
discriminants. These integers cross the FFI boundary raw (cast with `op as i32`
or `tag as i32`) and are decoded by `get_op()` and `get_datatype()` in
`csrc/ferrompi.c`. The discriminant values are a **semver contract**: renumbering
any variant silently corrupts any caller that stored the integer across a
version boundary. New variants must append at the next free integer; gaps are
forbidden.

### Unconditional C switch

The C switch that decodes op and datatype tags is always compiled
unconditionally, with no `#ifdef` guards. This is required because the Rust
layer may legally produce any tag value that the `ReduceOp` or `DatatypeTag`
enum can represent, including variants that are only reachable when specific
Cargo features (`rma`) are enabled. The C layer must accept all of them.

### Six-layer entry-point pattern

Every new MPI entry point follows this pattern in order:

1. C declaration in `csrc/ferrompi.h`.
2. C implementation in `csrc/ferrompi.c`.
3. `extern "C"` declaration in `src/ffi.rs`.
4. Safe Rust wrapper in the appropriate `src/` module.
5. Integration example in `examples/`.
6. `run_test` line in `tests/run_mpi_tests.sh`.

Omitting any layer is a scope violation. The test runner entry is particularly
easy to forget; it must be declared explicitly in each ticket's "Key Files to
Modify" list.

### `Error::check_with_op` at every call site

Every `ferrompi_*` FFI result must pass through
`Error::check_with_op(ret, "<tag>")`, where `<tag>` is the C function name with
the `ferrompi_` prefix stripped. Bare `Error::check(ret)` calls are forbidden in
production code and were eliminated in epic-05 (101 migration sites).

### `install_errors_return` on comm-creating shims

Every shim that creates a new communicator (or may return one from MPI) calls
`install_errors_return(newcomm)` before returning the handle to Rust. This sets
`MPI_ERRORS_RETURN` as the error handler, converting MPI library aborts into
return codes.

### Two-phase comm handle guard

Every shim that takes a `comm_handle` parameter applies a two-phase guard:
first `if (comm_handle < 0) return MPI_ERR_COMM;`, then
`if (comm == MPI_COMM_NULL) return MPI_ERR_COMM;` after resolving through
`comm_table`. This defends against both invalid indices and handles that were
freed but whose slot was not yet reclaimed.

### `MPI_UNDEFINED` normalised to `-1`

Every shim that may return `MPI_UNDEFINED` normalises it to `-1` before
returning to Rust. The Rust layer uses `-1` as the sentinel; `MPI_UNDEFINED`'s
value is implementation-defined (MPICH: `-32766`; Open MPI: `-1`).

### `MPI_MODE_*` constants queried at runtime

RMA fence and PSCW assert constants (`MPI_MODE_NOSTORE`, `MPI_MODE_NOPUT`,
`MPI_MODE_NOPRECEDE`, `MPI_MODE_NOSUCCEED`, and their PSCW equivalents) are
never hardcoded in Rust. They are queried once via C shims at first use and
cached via `OnceLock<[i32; N]>` in `src/window.rs`. This is required because
their values are implementation-defined.

## Generic-over-`MpiDatatype` Design

All ferrompi communication APIs are generic over the element type `T`. Four
sealed-trait families in `src/datatype.rs` define which Rust types are valid
for which MPI operations.

| Trait                | Sealed module        | Types                                                                   | Operations                                                        |
| -------------------- | -------------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------- |
| `MpiDatatype`        | `mod sealed`         | `f32`, `f64`, `i32`, `i64`, `u8`, `u32`, `u64`                          | All point-to-point, collectives, RMA                              |
| `AtomicMpiDatatype`  | `mod sealed_atomic`  | `i32`, `i64`, `u32`, `u64`, `u8`                                        | RMA atomic ops (`compare_and_swap`, `fetch_and_op`, `accumulate`) |
| `MpiIndexedDatatype` | `mod sealed_indexed` | `FloatInt`, `DoubleInt`, `LongInt`, `Int2`, `ShortInt`, `LongDoubleInt` | `allreduce_indexed` (MaxLoc/MinLoc)                               |
| `BytePermutable`     | `mod sealed_byte`    | `u8`, `u16`, `u32`, `u64`, `i8`, `i16`, `i32`, `i64`, `[T; N]`          | `allreduce_bytes` (bitwise ops only)                              |

Each trait uses a private `mod sealed` module containing a `Sealed` marker
trait. External crates cannot implement `sealed::Sealed` and therefore cannot
add new `MpiDatatype` (or any other) implementations. This prevents misuse at
compile time rather than at runtime.

`AtomicMpiDatatype` excludes `f32` and `f64`. Floating-point atomics (compare-
and-swap, fetch-and-op) are not defined by the MPI standard for floating-point
types; admitting `f64` into `Win::compare_and_swap` would silently produce
undefined behaviour on most MPI implementations. A `compile_fail` doctest in
`src/datatype.rs` verifies that `Win<f64>::compare_and_swap` is rejected by the
compiler.

Each trait carries a `const TAG: DatatypeTag` associated constant (for
`MpiDatatype`) or an equivalent tag mechanism, allowing the C layer to identify
the element type at runtime via the stable `#[repr(i32)]` discriminant. At
runtime, the `debug_assert_eq!(T::TAG as i32, mapped_tag(*dt))` assertion in
`UserOp` trampolines detects mismatches in debug builds without a hot-path
branch in release builds.

The authoritative design record for this trait family is
`adr/0003-generic-mpi-datatype.md`.

## Error Handling Model

All fallible ferrompi operations return `Result<T, Error>`. The `Error` type
has two primary variants: `Error::Mpi` for errors originating from the MPI
library, and validation variants (`Error::InvalidOp`, `Error::InvalidBuffer`,
etc.) for errors detected in Rust before any MPI call.

### `Error::Mpi` structure

`Error::Mpi` carries four fields:

| Field       | Type                   | Source                                                        |
| ----------- | ---------------------- | ------------------------------------------------------------- |
| `class`     | `MpiErrorClass`        | `MPI_Error_class(code)` — normalised to a stable Rust enum.   |
| `code`      | `i32`                  | Raw MPI error code as returned by the failing function.       |
| `message`   | `String`               | `MPI_Error_string(code)` — human-readable description.        |
| `operation` | `Option<&'static str>` | The C function name (without `ferrompi_` prefix) that failed. |

Pattern matches on `Error::Mpi` must bind all four fields or use `..`.
This is a compiler-enforced exhaustiveness requirement introduced as a breaking
change in version 0.4.0. Using `_` to suppress the `operation` field silently
discards attribution information; the correct pattern for ignoring a field is
`..`.

### `Error::check_with_op` pattern

```rust,ignore
Error::check_with_op(ret, "allreduce")?;
```

This is the standard idiom at every FFI call site. It returns `Ok(())` when
`ret == MPI_SUCCESS` and constructs `Err(Error::Mpi { operation: Some("allreduce"), .. })`
otherwise. The tag string must match the C function name with `ferrompi_`
stripped; a grep over `csrc/ferrompi.c` confirms each tag.

### `Display` implementation

`Display` for `Error` is hand-rolled in `src/error.rs`. The `thiserror` crate
was removed from the dependency list in version 0.4.0 because `operation`
requires conditional formatting that `#[derive(thiserror::Error)]` cannot
express cleanly. The `[dependencies]` table in `Cargo.toml` is currently empty
as a result.

## Cross-References to ADRs

Five Architecture Decision Records provide the authoritative rationale for
ferrompi's major design choices. These records are immutable once accepted;
if a decision changes, a new ADR supersedes the old one.

- `adr/0001-why-c-wrapper.md` — Why ferrompi uses a hand-written C shim layer
  rather than direct Rust FFI against `libmpi.so`. Covers handle table
  portability, `_c` large-count dispatch, and `MPI_UNDEFINED` normalisation.
- `adr/0002-handle-tables.md` — Concurrency strategy for the request table
  under `MPI_THREAD_MULTIPLE`. Justifies C11 atomics with `atomic_compare_exchange_strong_explicit`
  over a pthread mutex (rejected: serialises the hot read path) and a lock-free
  Treiber stack (rejected: ABA complexity, Cray toolchain gaps).
- `adr/0003-generic-mpi-datatype.md` — Design of the sealed-trait type family
  (`MpiDatatype`, `MpiIndexedDatatype`, `BytePermutable`, `AtomicMpiDatatype`)
  and the `#[repr(i32)]` discriminant ABI contract for `DatatypeTag`.
- `adr/0004-persistent-collective-approach.md` — How ferrompi exposes MPI 4.0
  persistent collective operations (`*_init` / `start` / `wait` lifecycle),
  including the decision to omit `_c` large-count variants from persistent
  shims.
- `adr/0005-mpi-op-create.md` — Seven decisions covering the
  `MPI_Op_create` trampoline safety model: closure storage (per-op static slot
  table), `Send + Sync + 'static` bounds, `MPI_Op_free`-before-slot-release drop
  ordering, default commutativity, per-slot baked-index C trampolines, and
  `catch_unwind + abort` panic handling.