tokitai-operator 0.1.0

Verified DL kernel compiler: formally-checked GEMM, p-adic, sheaf, contract-carrying ops. Paper-artifact grade.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
# Tokitai Operator Architecture

This document sketches the architecture for a Rust operator system whose core
abstractions can serve both conventional tensor workloads and less common
mathematical domains such as sheaves, p-adic fields, non-Archimedean norms, and
structured algebraic spaces.

The intended bar is higher than "a tensor library with extra dtypes". Tokitai
should become a mathematical operator compiler: a system that uses algebraic
laws, locality, valuation, precision, and representation constraints as
optimization inputs. Tensor scheduling, p-adic precision control, and
sheaf-theoretic local-to-global decomposition should be expressed through the
same planning machinery.

The guiding principle is separation of concerns:

- Mathematical meaning is defined independently of storage, scheduling, and
  hardware execution.
- Numeric domains are first-class objects, not extensions of floating point
  dtypes.
- Operators describe valid transformations between typed mathematical objects.
- Rewrites carry evidence explaining why they are valid.
- Precision, locality, and valuation are schedulable resources.
- Backends decide how a valid transformation is executed on a concrete device.

Post-submission SIMD/GPU backend exploration is scoped in
`docs/paper/post_submission_accelerators.md`. That plan keeps CPU reference execution as the current correctness baseline
and treats accelerated backends as future work behind explicit registry/lowering proof obligations.

## Goals

Tokitai should eventually support two classes of workloads:

1. High-performance tensor computation on ordinary numeric domains such as
   `f32`, `f64`, `bf16`, `fp8`, integer, boolean, and quantized formats.
2. Mathematically structured computation over domains such as finite fields,
   rings, p-adic fields, valued fields, sheaves over sites, chain complexes, and
   locally defined data with gluing constraints.

The architecture must avoid a common failure mode: designing a fast tensor
library first, then trying to attach richer mathematics through ad hoc metadata.
Instead, tensors should be one representation of a broader object system.

The "next-generation" claim depends on five concrete capabilities:

1. Contract-carrying operators: every operator exposes semantic requirements,
   invariants, laws, and precision behavior.
2. Evidence-carrying rewrites: every optimization records the law, theorem,
   approximation rule, or backend equivalence that justifies it.
3. Layered local-to-global execution: computation can be split across covers,
   strata, precision layers, memory hierarchy, and hardware hierarchy using one
   planner vocabulary.
4. Adaptive mathematical scheduling: the planner can choose not only tile sizes
   and kernels, but also algebraic algorithms, precision schedules, covers,
   lifting strategies, and representations.
5. Domain-aware verification: optimized plans are checked against declared laws,
   reference semantics, and domain-specific equality.

This positions Tokitai closer to a proof-aware mathematical runtime than a
conventional eager tensor package.

## Research Positioning

Tokitai should learn from existing systems but not copy their boundaries:

- TVM, TensorIR, Ansor, Triton, and related systems show how tensor programs can
  be represented, scheduled, searched, and lowered efficiently.
- MLIR and IREE show how layered IRs and target-specific lowering can scale
  across backends.
- FlashAttention shows that new operators emerge when the memory model is part
  of the algorithm, not merely an implementation detail.
- Superoptimization work such as Mirage points toward multi-level search across
  graph, operator, and kernel representations.
- Computer algebra systems and proof assistants show the importance of laws,
  exactness, and constructive evidence, but they usually do not target adaptive
  high-performance execution.

Tokitai's intended gap is the intersection: high-performance operator planning
where algebraic structure, topology, valuation, and precision are explicit
optimization dimensions.

## Pain Points In Current Operator Systems

The strongest current systems provide useful lessons, but they expose recurring
pain points that Tokitai should avoid by design.

### 1. Graph Capture Is Fragile

PyTorch 2.x improves eager programs through graph capture and compilation, but
graph breaks remain a major performance and usability issue. Data-dependent
Python control flow, unsupported Python functions, C extensions, and custom
logic can fragment a program into multiple compiled graphs. Once the graph is
fragmented, fusion and global optimization opportunities are lost.

Tokitai design response:

- Use an explicit semantic builder for core programs instead of relying only on
  tracing host-language behavior.
- Treat non-capturable boundaries as typed effects, not silent graph breaks.
- Make partial compilation explicit in the IR and expose optimization loss.
- Provide escape hatches through custom operators with declared contracts.

### 2. Dynamic Shape Support Is Still A Core Weakness

Many compiler stacks still specialize heavily on static shapes. Dynamic shapes
cause recompilation, reduce optimization quality, or require padding and
bucketization. Bounded dynamic shapes help, but they do not solve all cases,
especially data-dependent output shapes.

Tokitai design response:

- Represent symbolic, bounded, and data-dependent shapes in `ObjectMeta`.
- Make shape constraints first-class planner facts.
- Cache plans by shape classes and contracts rather than only concrete shapes.
- Separate semantic validity from specialization quality.
- Add explicit fallback paths for data-dependent shape operators.

### 3. JIT And Autotuning Costs Are Operationally Painful

Triton, Inductor, XLA, TVM MetaSchedule, and similar systems can produce strong
kernels, but compilation and tuning time can dominate production workflows.
Search-based tuning trades time for performance, and tuning records are often
shape-, fusion-, and backend-specific.

Tokitai design response:

- Support a tiered planner: heuristic baseline first, measured tuning later.
- Persist tuning records and rewrite evidence separately.
- Use transferable plan keys based on semantic structure, shape class, domain,
  layout, and backend capabilities.
- Make tuning budget explicit in the API.
- Prefer plans that are robust across shape buckets when latency matters.

### 4. Fusion Is Powerful But Pattern-Limited

Libraries such as oneDNN Graph, cuDNN Graph, CUTLASS epilogues, and compiler
fusers can remove memory traffic, but their fusion power is often constrained by
recognized patterns, layout assumptions, hardware generation, or epilogue
interfaces.

Tokitai design response:

- Model fusion as a contract-checked rewrite instead of a fixed pattern list.
- Represent producer-consumer relations, aliasing, layout, and precision
  obligations explicitly.
- Allow backend-specific fusion without losing the semantic audit trail.
- Support both graph-level fusion and local-to-global decomposition.

### 5. Performance Portability Remains Incomplete

High-performance kernels are usually tied to a narrow hardware family, memory
hierarchy, data type set, or layout. CUTLASS is excellent for NVIDIA GEMM-like
workloads, oneDNN is strong on CPU, XLA targets TPU well, and Triton is strong
for custom GPU kernels, but each stack has different limits and tuning
assumptions.

Tokitai design response:

- Keep backend capabilities explicit and queryable.
- Do not let a backend define operator semantics.
- Allow mixed execution and graceful fallback.
- Treat hardware hierarchy as one kind of layer in `LayerIR`.
- Keep representation selection independent from mathematical domain.

### 6. Debuggability And Correctness Are Weak Spots

Fused generated kernels are hard to inspect. Incorrect compiler output,
numerical drift, nondeterministic reductions, layout mistakes, and backend
fallbacks are often difficult to diagnose. Most systems optimize first and make
the user reconstruct why a transformation happened.

Tokitai design response:

- Every semantic rewrite must carry evidence.
- Every backend semantic degradation must become an obligation.
- Provide `Audited`, `Checked`, and `Reference` verification modes.
- Attach source-level and IR-level spans to lowered plans.
- Make approximate equality domain-specific.

### 7. Numeric Semantics Are Too Float-Centric

Most operator systems assume floating point, integer, or quantized tensor
semantics. Exact rings, finite fields, p-adic fields, interval arithmetic,
non-Archimedean norms, sheaves, and categorical structures are outside the
optimization model.

Tokitai design response:

- Make domains, laws, valuations, topology, and precision models first-class.
- Distinguish mathematical domain from storage dtype.
- Use contracts to guard rewrites that are unsound for approximate arithmetic.
- Let p-adic precision and sheaf locality participate directly in planning.

## Non-Goals For The Initial Version

- Full symbolic algebra.
- A complete theorem prover.
- A complete MLIR or LLVM integration.
- Production GPU kernels for every operator.
- Exhaustive support for all algebraic geometry or category theory constructs.

The first version should establish the right boundaries and make a small number
of operations correct, inspectable, and extensible.

## Architectural Layers

The system is organized into eight layers:

1. Mathematical domain layer
2. Object and representation layer
3. Contract and evidence layer
4. Operator semantics layer
5. IR and rewrite layer
6. Planning and adaptation layer
7. Backend execution layer
8. Verification and property testing layer

Each layer owns one kind of decision. Cross-layer shortcuts should be avoided
unless they are explicitly modeled as optimizations.

For visual overviews of the eight layers, the planner pipeline, the
proof-replay trace flow, and the theory-evidence chain, see
[`docs/dev/architecture_diagrams.md`](docs/dev/architecture_diagrams.md).

## 1. Mathematical Domain Layer

The domain layer defines the algebraic and analytic setting in which values
live. This layer should not know about tensors, memory layout, or devices.

Core concepts:

- `Domain`: a named mathematical universe for values.
- `AlgebraicStructure`: field, ring, semiring, module, vector space, lattice,
  group, monoid, category, site.
- `Scalar`: an element of a domain.
- `Valuation`: a map such as a p-adic valuation.
- `Norm`: Archimedean or non-Archimedean norm.
- `PrecisionModel`: exact, fixed precision, adaptive precision, interval, ball,
  lazy, symbolic.
- `Topology`: discrete, metric, ultrametric, Zariski-like, site-induced.

Example domains:

- `RealField<f32>`
- `ComplexField<f64>`
- `FiniteField<P>`
- `IntegerRing`
- `RationalField`
- `PadicField<P, Precision>`
- `TropicalSemiring`
- `BooleanAlgebra`

The important design choice is that `dtype` is not enough. A value type needs
both a storage representation and a mathematical domain. For example, a p-adic
number may be stored using machine integers, big integers, limbs, or lazy
expansions, but those storage choices do not define the p-adic field itself.

Suggested Rust shape:

```rust
pub trait Domain {
    type Element;

    fn name(&self) -> &'static str;
    fn structure(&self) -> AlgebraicStructure;
    fn precision_model(&self) -> PrecisionModel;
}

pub trait ValuedDomain: Domain {
    type ValueGroup;

    fn valuation(&self, x: &Self::Element) -> Self::ValueGroup;
}

pub trait NonArchimedeanDomain: ValuedDomain {
    fn satisfies_ultrametric(&self) -> bool {
        true
    }
}
```

The trait design should remain conservative. Rust traits should encode stable
capabilities, not every theorem we might want to express.

## 2. Object And Representation Layer

This layer defines the objects manipulated by operators.

Conventional tensor libraries usually expose only tensors. Tokitai should expose
a broader set of typed objects:

- `Tensor`: dense or sparse indexed data over a domain.
- `Section`: local data over an open set or object in a site.
- `Sheaf`: assignment of sections with restriction and gluing structure.
- `Complex`: chain or cochain complex with differentials.
- `Morphism`: structure-preserving map between objects.
- `BundleLike`: family of fibers over a base.
- `GraphObject`: graph, hypergraph, or incidence structure.

Object metadata should describe both mathematical meaning and representation:

```rust
pub struct ObjectMeta {
    pub object_kind: ObjectKind,
    pub domain: DomainId,
    pub shape: Shape,
    pub representation: Representation,
    pub invariants: Vec<Invariant>,
}
```

For tensors, `shape`, `stride`, `layout`, and `device` remain critical. For
sheaves and p-adic data, additional metadata is needed:

- Base space or site identifier.
- Cover information.
- Restriction maps.
- Gluing constraints.
- Precision and truncation level.
- Valuation bounds.
- Error balls or intervals.

The representation layer should permit multiple physical encodings for the same
object:

- Dense tensor.
- Sparse tensor.
- Blocked tensor.
- Lazy expression.
- Cover-indexed section table.
- p-adic limb array.
- p-adic ball representation.
- Symbolic placeholder.

This makes it possible to optimize storage and execution without changing
operator semantics.

## 3. Contract And Evidence Layer

The contract layer is what makes Tokitai more than a generic abstraction stack.
It defines what the system is allowed to assume, transform, approximate, and
verify.

Every domain, object, operator, rewrite, and backend lowering should be able to
attach contracts. A contract is a machine-readable claim with a scope and an
evidence level.

Examples of contracts:

- Algebraic: associative, commutative, distributive, has identity, has inverse.
- Analytic: Lipschitz bound, non-Archimedean norm, convergence radius.
- Precision: loses at most `k` p-adic digits, preserves ball containment, exact.
- Topological: local, cover-preserving, gluable, restriction-compatible.
- Categorical: functorial, natural, adjoint-like, monoidal.
- Backend: bit-exact, deterministic, tolerance-bounded, reduction-order
  dependent.

Evidence levels:

- `Axiom`: trusted declaration from a domain implementation.
- `Derived`: inferred from other contracts.
- `Checked`: validated by runtime checks for a particular object instance.
- `Tested`: supported by property tests but not proven.
- `ExternalProof`: linked to proof artifacts or certified kernels.
- `UnsafeAssumption`: allowed only behind explicit opt-in.

Suggested shape:

```rust
pub struct Contract {
    pub claim: Claim,
    pub scope: Scope,
    pub evidence: Evidence,
    pub conditions: Vec<Condition>,
}

pub struct RewriteJustification {
    pub source: ContractId,
    pub rule: RewriteRuleId,
    pub discharged_conditions: Vec<ConditionId>,
}
```

This layer gives the optimizer a hard rule: no semantic rewrite without a
justification. For example, reassociating a reduction over `f32` requires a
contract permitting approximate reassociation, while reassociating over a finite
field can be justified by exact associativity.

For p-adic arithmetic, contracts should express valuation and precision facts:

- `v_p(x * y) = v_p(x) + v_p(y)` for nonzero values.
- `v_p(x + y) >= min(v_p(x), v_p(y))`.
- Operation result is valid modulo `p^N`.
- A truncation preserves the requested output precision.

For sheaf-like computation, contracts should express locality:

- A local operator commutes with restriction.
- Compatible sections remain compatible after applying a local operator.
- A gluing operation is valid for a specific cover and sheaf condition.

## 4. Operator Semantics Layer

Operators define valid transformations between objects. An operator should
answer:

- What input object kinds are accepted?
- What domain capabilities are required?
- What output objects are produced?
- What invariants are preserved or introduced?
- What precision behavior is expected?
- Is the operation local, global, linear, multilinear, functorial, exact, or
  approximate?

An operator must not directly encode a kernel implementation.

Examples:

- `Add`, `Mul`, `Matmul`, `Reduce`, `Map`, `Zip`.
- `Restriction`, `Gluing`, `CechCoboundary`, `Pushforward`, `Pullback`.
- `PadicAdd`, `PadicMul`, `PadicExp`, `PadicLog`, `HenselLift`.
- `Valuation`, `Norm`, `Completion`, `Residue`.
- `Differential`, `Boundary`, `CohomologyApprox`.

Suggested shape:

```rust
pub trait Operator {
    fn name(&self) -> &'static str;
    fn signature(&self) -> OpSignature;
    fn infer(&self, inputs: &[ObjectMeta]) -> Result<Vec<ObjectMeta>>;
    fn required_contracts(&self) -> ContractSet;
    fn provided_contracts(&self) -> ContractSet;
}
```

Contracts are important for optimization and testing. For ordinary arithmetic
this includes associativity, distributivity, commutativity, identity, and
annihilator rules where valid. For p-adic domains, contracts may include
valuation identities and ultrametric inequalities. For sheaves, contracts
include restriction composition and gluing uniqueness where applicable.

Operators should declare lawfulness conditionally. For example, floating point
addition is not strictly associative, while exact finite-field addition is.

Operators should also declare their locality and layer behavior:

- `Pointwise`: independent across indices or opens.
- `NeighborhoodLocal`: depends on bounded local overlap.
- `CoverLocal`: can be computed per cover element then glued.
- `Global`: requires whole-object information.
- `PrecisionLayered`: can be refined by increasing precision.
- `ValuationFiltered`: can prune or reorder work using valuation bounds.

This classification feeds directly into planning.

## 5. IR And Rewrite Layer

Tokitai should use several IR levels, each preserving enough semantic metadata
to justify later transformations:

1. `SemanticIR`: graph of object-level operators with domain-aware typing.
2. `ContractIR`: attached claims, obligations, and rewrite justifications.
3. `LayerIR`: local-to-global decomposition over covers, strata, precision
   layers, tensor tiles, and hardware hierarchy.
4. `ScheduleIR`: lower-level representation of loops, memory movement,
   precision operations, and dispatch decisions.

Optional later layers:

- `TensorIR`: loop nests and index expressions for tensor-like computation.
- `SheafIR`: cover-wise local operations plus restriction/gluing steps.
- `PadicIR`: precision-aware arithmetic and valuation control.

Semantic rewrites must be law-aware. The optimizer should not blindly reorder
floating point reductions or truncate p-adic expansions unless the domain and
precision model allow it.

Layer rewrites are a core differentiator. A computation may be decomposed along
several compatible layer structures:

- Tensor layers: tiles, blocks, warps, vector lanes, cache levels.
- Sheaf layers: opens, covers, overlaps, strata, glued global sections.
- p-adic layers: residues modulo `p`, lifts modulo `p^N`, valuation strata,
  precision balls.
- Algorithmic layers: graph-level fusion, operator-level decomposition,
  kernel-level scheduling.

The planner should be able to align these layers. For example, a p-adic tensor
operation over sections can be planned as:

1. split by cover element;
2. split each section tensor by memory tile;
3. compute modulo lower precision first;
4. lift only tiles whose valuation bounds affect the requested result;
5. glue compatible local results.

This is the architectural reason to treat p-adic precision and sheaf locality as
first-class scheduling axes.

Examples of valid rewrites:

- Fuse elementwise maps over exact rings.
- Reassociate reductions only when the domain declares associativity and the
  precision model permits it.
- Push restriction through a local operator.
- Replace global sheaf computation with cover-wise local computation followed by
  gluing.
- Use valuation bounds to skip terms whose contribution is below precision.
- Choose Hensel lifting for appropriate p-adic root-finding problems.

The IR should make precision-changing operations explicit. Hidden truncation is
dangerous in p-adic and interval-like domains.

Every rewrite should produce a `RewriteJustification`. This makes optimized IR
auditable and creates a path toward proof-carrying optimization later.

## 6. Planning And Adaptation Layer

The planner maps a valid semantic program to an executable plan for a target.

Inputs:

- Operator graph.
- Object metadata.
- Domain capabilities.
- Active contracts and unresolved obligations.
- Hardware capabilities.
- Precision requirements.
- Runtime shape and valuation information.

Outputs:

- Execution plan.
- Chosen representations.
- Chosen kernels.
- Precision schedule.
- Local-to-global decomposition.
- Required checks.
- Rewrite evidence log.
- Memory plan.
- Fallback paths.

Adaptation should happen at multiple levels:

- Shape adaptation: choose kernels by shape buckets.
- Layout adaptation: choose dense, sparse, blocked, or cover-indexed layouts.
- Precision adaptation: increase p-adic precision or interval bounds when
  required.
- Algebraic adaptation: choose algorithms based on field/ring/domain laws.
- Locality adaptation: choose covers, overlap checks, gluing order, and strata.
- Evidence adaptation: choose faster plans only when obligations can be
  discharged.
- Hardware adaptation: choose SIMD, threading, CUDA, ROCm, Metal, or WebGPU.

For ordinary tensor workloads, this resembles TVM/Ansor/Triton-style scheduling
and autotuning. For p-adic and sheaf workloads, adaptation also includes
mathematical strategy selection.

Example:

- A p-adic matrix solve may choose between direct elimination, modular reduction
  plus lifting, or iterative refinement depending on precision and valuation.
- A sheaf operation may choose whether to compute on all opens, a basis of the
  topology, or a selected cover.
- A tensor contraction may choose between naive loops, tiled SIMD, packed
  matrix multiplication, or backend-specific generated code.

Planning should optimize a multi-objective cost model:

- Runtime.
- Memory traffic.
- Allocation pressure.
- Precision loss.
- Number of unresolved runtime checks.
- Communication across cover overlaps or device boundaries.
- Reuse of tuned kernels and cached proof obligations.

This means a slower kernel may be preferred when it preserves more p-adic
precision, avoids expensive gluing checks, or keeps a rewrite in the exact
fragment.

## 7. Backend Execution Layer

Backends execute lowered plans. They should not decide mathematical validity.

Initial backend targets:

- `CpuScalar`: correctness-first fallback.
- `CpuSimd`: vectorized execution for ordinary numeric domains.
- `CpuThreaded`: parallel execution.

Later targets:

- `Cuda`
- `Rocm`
- `Metal`
- `WebGpu`
- `Llvm`
- `Mlir`

Backend traits should expose capabilities rather than hardcoded assumptions:

```rust
pub trait Backend {
    fn name(&self) -> &'static str;
    fn capabilities(&self) -> BackendCapabilities;
    fn compile(&self, plan: &LoweredPlan) -> Result<Executable>;
    fn execute(&self, executable: &Executable, args: &[ObjectRef]) -> Result<()>;
}
```

Not every backend should support every domain. A CUDA backend may support dense
`f32` tensor arithmetic but not arbitrary precision p-adics. A CPU backend may
support p-adic arithmetic but only through scalar or limb-vectorized code.

The runtime must allow graceful fallback and mixed execution.

Backends should also report semantic degradation. For example, a backend may
declare that a reduction is nondeterministic, that an operation uses approximate
transcendentals, or that a p-adic operation is implemented only up to fixed
precision. These reports become obligations for the verification layer or cause
the planner to reject the backend for exact workloads.

## 8. Verification And Property Testing Layer

Because the library targets nontrivial mathematics, verification is part of the
architecture rather than an afterthought. Verification is responsible for
discharging obligations produced by operators, rewrites, planners, and backends.

Verification should include:

- Shape and domain type checking.
- Invariant checking.
- Property-based tests from declared laws.
- Cross-backend consistency checks.
- Precision audits.
- Reference implementations.
- Runtime discharge of conditional contracts.
- Replayable audit trails for rewrite justifications.
- Optional links to external proof artifacts.

Examples:

- `v_p(x * y) = v_p(x) + v_p(y)` when both values are nonzero.
- `|x + y|_p <= max(|x|_p, |y|_p)`.
- Restriction maps compose correctly.
- Compatible local sections glue to a unique global section when the sheaf
  declares the relevant condition.
- Optimized tensor reductions match reference reductions within the declared
  numerical tolerance.

For approximate domains, equality should be domain-specific:

- Exact equality for finite fields.
- Tolerance-based equality for floating point.
- Ball containment or precision-bounded equality for p-adics.
- Isomorphism or natural equivalence for some categorical objects.

Verification modes:

- `Fast`: type, shape, and cheap invariant checks.
- `Audited`: records all rewrites and discharged obligations.
- `Checked`: inserts runtime checks for conditional contracts.
- `Reference`: compares against slow reference implementations.
- `Research`: property testing, randomized differential testing, and proof
  artifact export.

This layer should make unsoundness visible. Unsafe assumptions are allowed only
when explicitly requested by a caller or benchmark harness.

## p-adic Design Notes

p-adic support should be modeled as a valued non-Archimedean domain, not as a
custom integer encoding.

Essential metadata:

- Prime `p`.
- Precision `N`.
- Representation mode: canonical residue, expansion, ball, lazy.
- Valuation bounds.
- Error model.

Core operations:

- Addition, subtraction, multiplication.
- Inversion where valuation permits.
- Division with explicit domain checks.
- Valuation.
- Norm.
- Truncation and precision extension.
- Hensel lifting.
- Polynomial evaluation.
- Matrix operations over p-adic fields.

Important planning rules:

- Track precision loss explicitly.
- Prefer valuation-aware algorithms.
- Avoid converting to floating point except for diagnostics or heuristic cost
  estimates.
- Do not assume total ordering.
- Use ultrametric properties for pruning and error bounds.

Planner-level p-adic strategies:

- Compute coarse residues first, then lift only when required.
- Use valuation bounds to skip products or summands that cannot affect the
  requested precision.
- Choose between exact modular arithmetic, fixed-precision arithmetic, ball
  arithmetic, and lazy expansion based on downstream obligations.
- Select Hensel lifting when derivative and residue conditions discharge the
  relevant contracts.
- Prefer algorithms that expose precision loss structurally instead of hiding it
  inside scalar operations.

p-adic values should be able to carry both data and certificates:

```rust
pub struct PadicMeta {
    pub prime: u64,
    pub precision: Precision,
    pub valuation_bound: Option<ValuationBound>,
    pub representation: PadicRepresentation,
    pub error: PadicErrorModel,
}
```

The first implementation can be small and fixed-precision, but the metadata
should leave room for adaptive precision and ball arithmetic.

## Sheaf And Layer-Theoretic Design Notes

"Layer theory" can be interpreted operationally as computation over structured
layers of local-to-global data. The closest stable mathematical abstraction is
sheaf-like computation over a base space, site, cover, or stratified object.

Tokitai should model this with:

- `BaseSpace` or `Site`.
- `Open` or generalized object of the site.
- `Cover`.
- `Section`.
- `RestrictionMap`.
- `GluingCondition`.
- `Sheaf`.

Minimal operations:

- Restrict a section to a smaller open.
- Check compatibility of local sections on overlaps.
- Glue compatible sections.
- Apply local operators cover-wise.
- Pull back or push forward along a morphism of bases.

The key execution idea is locality. If an operator is declared local, the planner
may compute it independently on cover elements and then glue or validate
compatibility. This is the sheaf-theoretic analogue of tensor fusion and tiling:
the mathematical structure exposes legal decomposition.

Planner-level sheaf strategies:

- Replace a global local operator with cover-wise local operators plus gluing.
- Choose a smaller basis or refinement when it reduces overlap checks.
- Push restrictions through local operators to avoid computing unused global
  data.
- Cache overlap computations as reusable obligations.
- Use Cech-style complexes when global consistency is the computation, not just
  a validation step.

Layer theory should therefore be treated as a first-class decomposition model:
each layer has local state, transition or restriction maps, compatibility rules,
and optional global reconstruction. This model can represent sheaves directly,
but it can also describe neural network layers, numerical multigrid levels,
precision refinements, and hardware memory hierarchy.

Potential future objects:

- Presheaves.
- Sheaves of modules.
- Cech complexes.
- Derived or cohomological approximations.
- Stratified spaces.
- Fiber bundles.
- Layered computational graphs where each layer has local rules and transition
  maps.

## Initial Milestones

### Milestone 0: Contracted Design Skeleton

- Create module boundaries.
- Define core metadata structs.
- Define conservative traits for domains, objects, contracts, operators, and
  backends.
- Add evidence and obligation data structures.
- Add this architecture document.

### Milestone 1: Minimal Semantic Runtime

- Implement dense tensor metadata as the first object representation.
- Implement finite-field or integer-like exact arithmetic plus `f32` as a
  contrasting approximate domain.
- Implement `Add`, `Mul`, `Map`, and `Reduce`.
- Add shape inference, domain checks, and contract propagation.

### Milestone 2: Evidence-Carrying Rewrites

- Add law and contract declarations.
- Add rewrite justifications and obligation tracking.
- Distinguish exact, floating, and approximate equality.
- Add rewrite guards based on lawfulness.
- Add property-based testing hooks generated from contracts.

### Milestone 3: LayerIR Prototype

- Represent tensor tiling, precision layers, and cover layers using one IR
  vocabulary.
- Add local/global operator classification.
- Add simple local-to-global planning.
- Emit auditable decomposition plans.

### Milestone 4: p-adic Prototype

- Implement a small fixed-precision p-adic domain.
- Support valuation, addition, multiplication, inversion, and truncation.
- Add p-adic property tests.
- Add simple p-adic vector and matrix operations.
- Add valuation-aware planning for at least one reduction-like operation.

### Milestone 5: Sheaf Prototype

- Implement a finite site or finite topological basis.
- Implement sections, restrictions, covers, and compatibility checks.
- Implement local operator application and gluing.
- Add Cech-style construction only after the basic model is stable.

### Milestone 6: Adaptive Planning

- Add execution plans.
- Add CPU backend selection.
- Add representation selection.
- Add shape and precision aware dispatch cache.
- Add cost model terms for precision loss, obligations, and locality.

### Milestone 7: Performance Backends

- Add SIMD and threading.
- Add kernel registry.
- Add benchmarking and tuning records.
- Explore GPU or MLIR integration.

## Design Constraints

- Keep mathematical semantics independent from storage representation.
- Make precision changes explicit.
- Avoid assuming floating point algebraic laws.
- Avoid assuming total ordering or Archimedean norms.
- Treat tensor computation as a special case of typed object computation.
- Prefer small, inspectable core abstractions over large generic hierarchies.
- Let optimization be justified by declared laws and checked invariants.
- Never perform a semantic rewrite without evidence.
- Treat precision, locality, and representation as planner resources.
- Make degraded backend semantics explicit.

## Recommended First Code Structure

```text
src/
  lib.rs
  domain/
    mod.rs
    structure.rs
    contract.rs
    valued.rs
    padic.rs
  object/
    mod.rs
    tensor.rs
    sheaf.rs
    meta.rs
  op/
    mod.rs
    signature.rs
    contract.rs
    arithmetic.rs
  ir/
    mod.rs
    semantic.rs
    contract.rs
    layer.rs
    schedule.rs
  planner/
    mod.rs
    obligation.rs
    plan.rs
  backend/
    mod.rs
    cpu.rs
  verify/
    mod.rs
    equality.rs
    properties.rs
```

The crate can still expose a simple high-level API later. The internal structure
should remain domain-first so that uncommon mathematical objects do not become
second-class extensions.

## 13. Post-P320 Refactor Backlog (Code Review Findings)

The post-P320 code review produced 13 concrete improvement phases
(P321-P333) covering refactors, API ergonomics, and documentation. The
phases are tracked in `todo.json` under `improvement_phases`.

### Tier A — Critical Refactors (5 phases)

- **P321**: Split `src/verify/mod.rs` (7,685 lines, 125 `pub use` re-exports)
  into focused submodules (`proof_replay`, `audit`, `reports`).
- **P322**: Replace remaining `unwrap()` in `src/checkpoint.rs` TKP1 binary
  parser with statically-safe inline 4-byte array construction.
- **P323**: Add `Shape::first_dim()`, `Shape::last_dim()`, `Shape::dim(index)`
  accessors so tensor code does not reach into `shape.dims[i]` directly.
- **P324**: Add `Padic::precision_consistent_with` and unify the
  `digits_base_p` / `digit_sum` / `lowest_digit` / `first_nonzero_digit`
  helpers via a private `digit_at(index)` helper.
- **P325**: Add `Tensor::try_from_vec` (Result-returning constructor that
  validates data length matches the product of static shape dimensions).

### Tier B — API Polish (5 phases)

- **P326**: Replace 125 individual `pub use` re-exports in
  `src/verify/mod.rs` with curated `pub mod` re-export blocks.
- **P327**: Add `CostModelWeights::balanced()` (default), `strict()` (3x
  base, 5x penalties), and `permissive()` (half base, 2x credits) constructors.
- **P328**: Add `Dim::Static.value()` and `value_or(default)` accessors
  so callers stop pattern-matching on the variant. Used by P323's refactor
  of `Tensor::flatten_first_axis`.
- **P329**: Add `Shape::ensure_rank` and `SectionTable::keys` / `values` /
  `iter` iterators.
- **P330**: Add `Tensor::truncate(n)` (drain prefix, return Vec) and
  `Tensor::into_inner` (consuming destructor) for MoE padding cleanup
  and meta/data hand-off.

### Tier C — Documentation (3 phases)

- **P331**: Add `examples/end_to_end_workflow.rs` covering
  `Tensor → plan → execute → verify` end-to-end (currently runs and prints
  `out.data = [4, 10, 18]` for `lhs=[1,2,3] * rhs=[4,5,6]`).
- **P332**: Add 3 `///` doc tests for top-level `Tokitai` facade methods
  (`cpu_only`, `padic_domain`, `execute_i64`).
- **P333**: This section — update `ARCHITECTURE.md` to reflect the
  post-P320 state.

### Test Baseline

The post-P320 baseline is **425 tests passing** across 76 test files
(393 passing in the pre-P321 baseline + 32 new tests added in P322-P330).
The full baseline runs cleanly with `cargo test --offline` and `cargo fmt --check`
produces no diffs.

### New Helpers in the post-P320 Surface

The following helpers were added in P306-P320 and are exercised by the
425-test baseline:

- **Tensor** (`src/object/tensor.rs`): `first`, `last`,
  `flatten_first_axis`, `element_equals`, `sum`, `reversed`, `min`, `max`,
  `truncate`, `into_inner`, `try_from_vec`.
- **Sheaf** (`src/object/sheaf.rs`): `SectionTable::len`, `is_empty`,
  `contains`, `keys`, `values`, `iter`; `FiniteSite::contains_all_opens`,
  `open_count`, `has_open`.
- **Padic** (`src/domain/padic.rs`): `Padic::digits_base_p` (lowest-digit-first base-p expansion), `Padic::lowest_digit`,
  `digit_sum`, `first_nonzero_digit`, `precision_consistent_with`,
  `same_prime_as`.
- **Shape** (`src/object/shape.rs`): `first_dim`, `last_dim`, `dim`,
  `ensure_rank`, plus `Dim::Static.value()` and `value_or()`.
- **CostModelWeights** (`src/planner/cost_model.rs`): `balanced()` (now
  the default), `strict()`, `permissive()`.

## 14. Post-P334 Functional Expansion (P335-P354)

The P335-P354 phases extend the operator surface, public documentation,
and developer ergonomics without touching the claim-boundary code that
the paper route depends on (audit, support matrix, release gate, theorem
binding, claim status, semantic conformance, HIP audit). Specifically:

- **P335-P339 (operator surface)**: 8 new arithmetic ops
  (SubOp, DivOp, ScalarAddOp, ScalarMulOp, PowOp, SqrtOp, Exp2Op, Log2Op),
  9 shape ops (ReshapeOp, TransposeOp, PermuteOp, SliceOp, ConcatOp,
  BroadcastOp, FlattenOp, SqueezeOp, UnsqueezeOp), 6 NN/activation ops
  (ReluOp, SigmoidOp, TanhOp, GELUOp, SoftmaxOp, LayerNormOp), 5 index
  ops (GatherOp, ScatterOp, IndexSelectOp, IndexAddOp, NonzeroOp), and
  9 reduction ops (SumOp, MeanOp, MaxOp, MinOp, ArgMaxOp, ArgMinOp,
  ProdOp, AnyOp, AllOp). All follow the same `op struct + OpSignature +
  infer + contracts + CPU helper + facade builder + integration test`
  pattern. The CPU backend is the correctness baseline; no GPU
  acceleration claim is associated with these ops.
- **P340 (doctests)**: 82 new `///` examples on the public `Tokitai`,
  `RuntimeOptions`, and `FacadeGraphBuilder` facade methods, bringing the
  doctest count to 85+ and making the rustdoc API self-explanatory.
- **P341 (book)**: 4 progressive tutorial chapters in `examples/book/`
  covering hello-tensor, shapes-and-reshape, p-adic-valuation-skip, and
  finite-sheaf-glue.
- **P342 (broken test fixes)**: 3 test files that depended on
  rocm-hip-only paths were re-gated or removed; the standard
  `cargo test --offline` now compiles and runs every test file.
- **P343 (graph DSL)**: a new closure-based `crate::ir::dsl::GraphBuilder`
  with `TensorHandle` and `CompiledGraph` lets users build a graph in
  a single expression. `Tokitai::graph(closure)` exposes the DSL through
  the public facade.
- **P344 (verify refactor)**: deferred to keep `src/verify/mod.rs`
  claim-boundary; tracked in `todo.json` for a future pass.
- **P349 (unwrap audit)**: 4 remaining public-path `unwrap()`s replaced
  with `expect()` / `map_or` / direct-indexing accessors.
- **P350 (integration test)**: `tests/mlp_forward_e2e.rs` exercises a
  full `Linear -> LayerNorm -> GELU -> Linear -> Softmax` chain through
  both the imperative and the DSL facade.
- **P352 (proptest)**: 5 new proptest files covering the new ops;
  `proptest = "1.5"` is now a dev-dependency.
- **P353 (architecture diagrams)**: 4 Mermaid diagrams in
  `docs/dev/architecture_diagrams.md` (8-layer, planner pipeline,
  proof-replay trace, theory-evidence chain).
- **P354 (this section)**: re-syncs the docs to reflect the new surface.

### Diagrams

See `docs/dev/architecture_diagrams.md` for visual overviews of the
eight architectural layers, the planner pipeline, the proof-replay
trace flow, and the theory-evidence chain.

### Test Baseline After P354

The post-P354 baseline is **611+ tests passing** (up from 425 at the
end of P333). The four pre-existing `tests/paper_artifacts.rs` failures
(claiming a `current_focus` for a P-number that has been re-baselined)
remain in scope of `tests/paper_artifacts.rs` only and are not part of
the canonical baseline.

### Test Baseline After P440

The post-P440 baseline is **693 tests passing** (up from 611+ at the
end of P354). The P381-P420 polish and documentation batches added the
observability test counterparts (`tests/claim_status_report.rs`,
`tests/submission_readiness.rs`) and the P426-P440 code-review improvement
batch added the variant-asserted `tests/error_variants.rs`, the
arithmetic `MatmulOp` / `FmaOp` / `MapOp` tests, the proptest `GeluOp` /
`SoftmaxOp` / `LayerNormOp` cases, and the `tests/README.md` per-item
coverage check. The full baseline runs cleanly with `cargo test --offline`
and `cargo fmt --check`.