tokitai-operator 0.1.0

# Tokitai Operator Architecture

This document sketches the architecture for a Rust operator system whose core
abstractions can serve both conventional tensor workloads and less common
mathematical domains such as sheaves, p-adic fields, non-Archimedean norms, and
structured algebraic spaces.

The intended bar is higher than "a tensor library with extra dtypes". Tokitai
should become a mathematical operator compiler: a system that uses algebraic
laws, locality, valuation, precision, and representation constraints as
optimization inputs. Tensor scheduling, p-adic precision control, and
sheaf-theoretic local-to-global decomposition should be expressed through the
same planning machinery.

The guiding principle is separation of concerns:

- Mathematical meaning is defined independently of storage, scheduling, and
  hardware execution.
- Numeric domains are first-class objects, not extensions of floating point
  dtypes.
- Operators describe valid transformations between typed mathematical objects.
- Rewrites carry evidence explaining why they are valid.
- Precision, locality, and valuation are schedulable resources.
- Backends decide how a valid transformation is executed on a concrete device.

Post-submission SIMD/GPU backend exploration is scoped in
`docs/paper/post_submission_accelerators.md`. That plan keeps CPU reference execution as the current correctness baseline
and treats accelerated backends as future work behind explicit registry/lowering proof obligations.

## Goals

Tokitai should eventually support two classes of workloads:

1. High-performance tensor computation on ordinary numeric domains such as
   `f32`, `f64`, `bf16`, `fp8`, integer, boolean, and quantized formats.
2. Mathematically structured computation over domains such as finite fields,
   rings, p-adic fields, valued fields, sheaves over sites, chain complexes, and
   locally defined data with gluing constraints.

The architecture must avoid a common failure mode: designing a fast tensor
library first, then trying to attach richer mathematics through ad hoc metadata.
Instead, tensors should be one representation of a broader object system.

The "next-generation" claim depends on five concrete capabilities:

1. Contract-carrying operators: every operator exposes semantic requirements,
   invariants, laws, and precision behavior.
2. Evidence-carrying rewrites: every optimization records the law, theorem,
   approximation rule, or backend equivalence that justifies it.
3. Layered local-to-global execution: computation can be split across covers,
   strata, precision layers, memory hierarchy, and hardware hierarchy using one
   planner vocabulary.
4. Adaptive mathematical scheduling: the planner can choose not only tile sizes
   and kernels, but also algebraic algorithms, precision schedules, covers,
   lifting strategies, and representations.
5. Domain-aware verification: optimized plans are checked against declared laws,
   reference semantics, and domain-specific equality.

This positions Tokitai closer to a proof-aware mathematical runtime than a
conventional eager tensor package.

## Research Positioning

Tokitai should learn from existing systems but not copy their boundaries:

- TVM, TensorIR, Ansor, Triton, and related systems show how tensor programs can
  be represented, scheduled, searched, and lowered efficiently.
- MLIR and IREE show how layered IRs and target-specific lowering can scale
  across backends.
- FlashAttention shows that new operators emerge when the memory model is part
  of the algorithm, not merely an implementation detail.
- Superoptimization work such as Mirage points toward multi-level search across
  graph, operator, and kernel representations.
- Computer algebra systems and proof assistants show the importance of laws,
  exactness, and constructive evidence, but they usually do not target adaptive
  high-performance execution.

Tokitai's intended gap is the intersection: high-performance operator planning
where algebraic structure, topology, valuation, and precision are explicit
optimization dimensions.

## Pain Points In Current Operator Systems

The strongest current systems provide useful lessons, but they expose recurring
pain points that Tokitai should avoid by design.

### 1. Graph Capture Is Fragile

PyTorch 2.x improves eager programs through graph capture and compilation, but
graph breaks remain a major performance and usability issue. Data-dependent
Python control flow, unsupported Python functions, C extensions, and custom
logic can fragment a program into multiple compiled graphs. Once the graph is
fragmented, fusion and global optimization opportunities are lost.

Tokitai design response:

- Use an explicit semantic builder for core programs instead of relying only on
  tracing host-language behavior.
- Treat non-capturable boundaries as typed effects, not silent graph breaks.
- Make partial compilation explicit in the IR and expose optimization loss.
- Provide escape hatches through custom operators with declared contracts.

### 2. Dynamic Shape Support Is Still A Core Weakness

Many compiler stacks still specialize heavily on static shapes. Dynamic shapes
cause recompilation, reduce optimization quality, or require padding and
bucketization. Bounded dynamic shapes help, but they do not solve all cases,
especially data-dependent output shapes.

Tokitai design response:

- Represent symbolic, bounded, and data-dependent shapes in `ObjectMeta`.
- Make shape constraints first-class planner facts.
- Cache plans by shape classes and contracts rather than only concrete shapes.
- Separate semantic validity from specialization quality.
- Add explicit fallback paths for data-dependent shape operators.

### 3. JIT And Autotuning Costs Are Operationally Painful

Triton, Inductor, XLA, TVM MetaSchedule, and similar systems can produce strong
kernels, but compilation and tuning time can dominate production workflows.
Search-based tuning trades time for performance, and tuning records are often
shape-, fusion-, and backend-specific.

Tokitai design response:

- Support a tiered planner: heuristic baseline first, measured tuning later.
- Persist tuning records and rewrite evidence separately.
- Use transferable plan keys based on semantic structure, shape class, domain,
  layout, and backend capabilities.
- Make tuning budget explicit in the API.
- Prefer plans that are robust across shape buckets when latency matters.

### 4. Fusion Is Powerful But Pattern-Limited

Libraries such as oneDNN Graph, cuDNN Graph, CUTLASS epilogues, and compiler
fusers can remove memory traffic, but their fusion power is often constrained by
recognized patterns, layout assumptions, hardware generation, or epilogue
interfaces.

Tokitai design response:

- Model fusion as a contract-checked rewrite instead of a fixed pattern list.
- Represent producer-consumer relations, aliasing, layout, and precision
  obligations explicitly.
- Allow backend-specific fusion without losing the semantic audit trail.
- Support both graph-level fusion and local-to-global decomposition.

### 5. Performance Portability Remains Incomplete

High-performance kernels are usually tied to a narrow hardware family, memory
hierarchy, data type set, or layout. CUTLASS is excellent for NVIDIA GEMM-like
workloads, oneDNN is strong on CPU, XLA targets TPU well, and Triton is strong
for custom GPU kernels, but each stack has different limits and tuning
assumptions.

Tokitai design response:

- Keep backend capabilities explicit and queryable.
- Do not let a backend define operator semantics.
- Allow mixed execution and graceful fallback.
- Treat hardware hierarchy as one kind of layer in `LayerIR`.
- Keep representation selection independent from mathematical domain.

### 6. Debuggability And Correctness Are Weak Spots

Fused generated kernels are hard to inspect. Incorrect compiler output,
numerical drift, nondeterministic reductions, layout mistakes, and backend
fallbacks are often difficult to diagnose. Most systems optimize first and make
the user reconstruct why a transformation happened.

Tokitai design response:

- Every semantic rewrite must carry evidence.
- Every backend semantic degradation must become an obligation.
- Provide `Audited`, `Checked`, and `Reference` verification modes.
- Attach source-level and IR-level spans to lowered plans.
- Make approximate equality domain-specific.

### 7. Numeric Semantics Are Too Float-Centric

Most operator systems assume floating point, integer, or quantized tensor
semantics. Exact rings, finite fields, p-adic fields, interval arithmetic,
non-Archimedean norms, sheaves, and categorical structures are outside the
optimization model.

Tokitai design response:

- Make domains, laws, valuations, topology, and precision models first-class.
- Distinguish mathematical domain from storage dtype.
- Use contracts to guard rewrites that are unsound for approximate arithmetic.
- Let p-adic precision and sheaf locality participate directly in planning.

## Non-Goals For The Initial Version

- Full symbolic algebra.
- A complete theorem prover.
- A complete MLIR or LLVM integration.
- Production GPU kernels for every operator.
- Exhaustive support for all algebraic geometry or category theory constructs.

The first version should establish the right boundaries and make a small number
of operations correct, inspectable, and extensible.

## Architectural Layers

The system is organized into eight layers:

1. Mathematical domain layer
2. Object and representation layer
3. Contract and evidence layer
4. Operator semantics layer
5. IR and rewrite layer
6. Planning and adaptation layer
7. Backend execution layer
8. Verification and property testing layer

Each layer owns one kind of decision. Cross-layer shortcuts should be avoided
unless they are explicitly modeled as optimizations.

For visual overviews of the eight layers, the planner pipeline, the
proof-replay trace flow, and the theory-evidence chain, see
[`docs/dev/architecture_diagrams.md`](docs/dev/architecture_diagrams.md).

## 1. Mathematical Domain Layer

The domain layer defines the algebraic and analytic setting in which values
live. This layer should not know about tensors, memory layout, or devices.

Core concepts:

- `Domain`: a named mathematical universe for values.
- `AlgebraicStructure`: field, ring, semiring, module, vector space, lattice,
  group, monoid, category, site.
- `Scalar`: an element of a domain.
- `Valuation`: a map such as a p-adic valuation.
- `Norm`: Archimedean or non-Archimedean norm.
- `PrecisionModel`: exact, fixed precision, adaptive precision, interval, ball,
  lazy, symbolic.
- `Topology`: discrete, metric, ultrametric, Zariski-like, site-induced.

Example domains:

- `RealField<f32>`
- `ComplexField<f64>`
- `FiniteField<P>`
- `IntegerRing`
- `RationalField`
- `PadicField<P, Precision>`
- `TropicalSemiring`
- `BooleanAlgebra`

The important design choice is that `dtype` is not enough. A value type needs
both a storage representation and a mathematical domain. For example, a p-adic
number may be stored using machine integers, big integers, limbs, or lazy
expansions, but those storage choices do not define the p-adic field itself.

Suggested Rust shape:

```rust
pub trait Domain {
    type Element;

    fn name(&self) -> &'static str;
    fn structure(&self) -> AlgebraicStructure;
    fn precision_model(&self) -> PrecisionModel;
}

pub trait ValuedDomain: Domain {
    type ValueGroup;

    fn valuation(&self, x: &Self::Element) -> Self::ValueGroup;
}

pub trait NonArchimedeanDomain: ValuedDomain {
    fn satisfies_ultrametric(&self) -> bool {
        true
    }
}
```

The trait design should remain conservative. Rust traits should encode stable
capabilities, not every theorem we might want to express.

## 2. Object And Representation Layer

This layer defines the objects manipulated by operators.

Conventional tensor libraries usually expose only tensors. Tokitai should expose
a broader set of typed objects:

- `Tensor`: dense or sparse indexed data over a domain.
- `Section`: local data over an open set or object in a site.
- `Sheaf`: assignment of sections with restriction and gluing structure.
- `Complex`: chain or cochain complex with differentials.
- `Morphism`: structure-preserving map between objects.
- `BundleLike`: family of fibers over a base.
- `GraphObject`: graph, hypergraph, or incidence structure.

Object metadata should describe both mathematical meaning and representation:

```rust
pub struct ObjectMeta {
    pub object_kind: ObjectKind,
    pub domain: DomainId,
    pub shape: Shape,
    pub representation: Representation,
    pub invariants: Vec<Invariant>,
}
```

For tensors, `shape`, `stride`, `layout`, and `device` remain critical. For
sheaves and p-adic data, additional metadata is needed:

- Base space or site identifier.
- Cover information.
- Restriction maps.
- Gluing constraints.
- Precision and truncation level.
- Valuation bounds.
- Error balls or intervals.

The representation layer should permit multiple physical encodings for the same
object:

- Dense tensor.
- Sparse tensor.
- Blocked tensor.
- Lazy expression.
- Cover-indexed section table.
- p-adic limb array.
- p-adic ball representation.
- Symbolic placeholder.

This makes it possible to optimize storage and execution without changing
operator semantics.

## 3. Contract And Evidence Layer

The contract layer is what makes Tokitai more than a generic abstraction stack.
It defines what the system is allowed to assume, transform, approximate, and
verify.

Every domain, object, operator, rewrite, and backend lowering should be able to
attach contracts. A contract is a machine-readable claim with a scope and an
evidence level.

Examples of contracts:

- Algebraic: associative, commutative, distributive, has identity, has inverse.
- Analytic: Lipschitz bound, non-Archimedean norm, convergence radius.
- Precision: loses at most `k` p-adic digits, preserves ball containment, exact.
- Topological: local, cover-preserving, gluable, restriction-compatible.
- Categorical: functorial, natural, adjoint-like, monoidal.
- Backend: bit-exact, deterministic, tolerance-bounded, reduction-order
  dependent.

Evidence levels:

- `Axiom`: trusted declaration from a domain implementation.
- `Derived`: inferred from other contracts.
- `Checked`: validated by runtime checks for a particular object instance.
- `Tested`: supported by property tests but not proven.
- `ExternalProof`: linked to proof artifacts or certified kernels.
- `UnsafeAssumption`: allowed only behind explicit opt-in.

Suggested shape:

```rust
pub struct Contract {
    pub claim: Claim,
    pub scope: Scope,
    pub evidence: Evidence,
    pub conditions: Vec<Condition>,
}

pub struct RewriteJustification {
    pub source: ContractId,
    pub rule: RewriteRuleId,
    pub discharged_conditions: Vec<ConditionId>,
}
```

This layer gives the optimizer a hard rule: no semantic rewrite without a
justification. For example, reassociating a reduction over `f32` requires a
contract permitting approximate reassociation, while reassociating over a finite
field can be justified by exact associativity.

For p-adic arithmetic, contracts should express valuation and precision facts:

- `v_p(x * y) = v_p(x) + v_p(y)` for nonzero values.
- `v_p(x + y) >= min(v_p(x), v_p(y))`.
- Operation result is valid modulo `p^N`.
- A truncation preserves the requested output precision.

For sheaf-like computation, contracts should express locality:

- A local operator commutes with restriction.
- Compatible sections remain compatible after applying a local operator.
- A gluing operation is valid for a specific cover and sheaf condition.

## 4. Operator Semantics Layer

Operators define valid transformations between objects. An operator should
answer:

- What input object kinds are accepted?
- What domain capabilities are required?
- What output objects are produced?
- What invariants are preserved or introduced?
- What precision behavior is expected?
- Is the operation local, global, linear, multilinear, functorial, exact, or
  approximate?

An operator must not directly encode a kernel implementation.

Examples:

- `Add`, `Mul`, `Matmul`, `Reduce`, `Map`, `Zip`.
- `Restriction`, `Gluing`, `CechCoboundary`, `Pushforward`, `Pullback`.
- `PadicAdd`, `PadicMul`, `PadicExp`, `PadicLog`, `HenselLift`.
- `Valuation`, `Norm`, `Completion`, `Residue`.
- `Differential`, `Boundary`, `CohomologyApprox`.

Suggested shape:

```rust
pub trait Operator {
    fn name(&self) -> &'static str;
    fn signature(&self) -> OpSignature;
    fn infer(&self, inputs: &[ObjectMeta]) -> Result<Vec<ObjectMeta>>;
    fn required_contracts(&self) -> ContractSet;
    fn provided_contracts(&self) -> ContractSet;
}
```

Contracts are important for optimization and testing. For ordinary arithmetic
this includes associativity, distributivity, commutativity, identity, and
annihilator rules where valid. For p-adic domains, contracts may include
valuation identities and ultrametric inequalities. For sheaves, contracts
include restriction composition and gluing uniqueness where applicable.

Operators should declare lawfulness conditionally. For example, floating point
addition is not strictly associative, while exact finite-field addition is.

Operators should also declare their locality and layer behavior:

- `Pointwise`: independent across indices or opens.
- `NeighborhoodLocal`: depends on bounded local overlap.
- `CoverLocal`: can be computed per cover element then glued.
- `Global`: requires whole-object information.
- `PrecisionLayered`: can be refined by increasing precision.
- `ValuationFiltered`: can prune or reorder work using valuation bounds.

This classification feeds directly into planning.

## 5. IR And Rewrite Layer

Tokitai should use several IR levels, each preserving enough semantic metadata
to justify later transformations:

1. `SemanticIR`: graph of object-level operators with domain-aware typing.
2. `ContractIR`: attached claims, obligations, and rewrite justifications.
3. `LayerIR`: local-to-global decomposition over covers, strata, precision
   layers, tensor tiles, and hardware hierarchy.
4. `ScheduleIR`: lower-level representation of loops, memory movement,
   precision operations, and dispatch decisions.

Optional later layers:

- `TensorIR`: loop nests and index expressions for tensor-like computation.
- `SheafIR`: cover-wise local operations plus restriction/gluing steps.
- `PadicIR`: precision-aware arithmetic and valuation control.

Semantic rewrites must be law-aware. The optimizer should not blindly reorder
floating point reductions or truncate p-adic expansions unless the domain and
precision model allow it.

Layer rewrites are a core differentiator. A computation may be decomposed along
several compatible layer structures:

- Tensor layers: tiles, blocks, warps, vector lanes, cache levels.
- Sheaf layers: opens, covers, overlaps, strata, glued global sections.
- p-adic layers: residues modulo `p`, lifts modulo `p^N`, valuation strata,
  precision balls.
- Algorithmic layers: graph-level fusion, operator-level decomposition,
  kernel-level scheduling.

The planner should be able to align these layers. For example, a p-adic tensor
operation over sections can be planned as:

1. split by cover element;
2. split each section tensor by memory tile;
3. compute modulo lower precision first;
4. lift only tiles whose valuation bounds affect the requested result;
5. glue compatible local results.

This is the architectural reason to treat p-adic precision and sheaf locality as
first-class scheduling axes.

Examples of valid rewrites:

- Fuse elementwise maps over exact rings.
- Reassociate reductions only when the domain declares associativity and the
  precision model permits it.
- Push restriction through a local operator.
- Replace global sheaf computation with cover-wise local computation followed by
  gluing.
- Use valuation bounds to skip terms whose contribution is below precision.
- Choose Hensel lifting for appropriate p-adic root-finding problems.

The IR should make precision-changing operations explicit. Hidden truncation is
dangerous in p-adic and interval-like domains.

Every rewrite should produce a `RewriteJustification`. This makes optimized IR
auditable and creates a path toward proof-carrying optimization later.

## 6. Planning And Adaptation Layer

The planner maps a valid semantic program to an executable plan for a target.

Inputs:

- Operator graph.
- Object metadata.
- Domain capabilities.
- Active contracts and unresolved obligations.
- Hardware capabilities.
- Precision requirements.
- Runtime shape and valuation information.

Outputs:

- Execution plan.
- Chosen representations.
- Chosen kernels.
- Precision schedule.
- Local-to-global decomposition.
- Required checks.
- Rewrite evidence log.
- Memory plan.
- Fallback paths.

Adaptation should happen at multiple levels:

- Shape adaptation: choose kernels by shape buckets.
- Layout adaptation: choose dense, sparse, blocked, or cover-indexed layouts.
- Precision adaptation: increase p-adic precision or interval bounds when
  required.
- Algebraic adaptation: choose algorithms based on field/ring/domain laws.
- Locality adaptation: choose covers, overlap checks, gluing order, and strata.
- Evidence adaptation: choose faster plans only when obligations can be
  discharged.
- Hardware adaptation: choose SIMD, threading, CUDA, ROCm, Metal, or WebGPU.

For ordinary tensor workloads, this resembles TVM/Ansor/Triton-style scheduling
and autotuning. For p-adic and sheaf workloads, adaptation also includes
mathematical strategy selection.

Example:

- A p-adic matrix solve may choose between direct elimination, modular reduction
  plus lifting, or iterative refinement depending on precision and valuation.
- A sheaf operation may choose whether to compute on all opens, a basis of the
  topology, or a selected cover.
- A tensor contraction may choose between naive loops, tiled SIMD, packed
  matrix multiplication, or backend-specific generated code.

Planning should optimize a multi-objective cost model:

- Runtime.
- Memory traffic.
- Allocation pressure.
- Precision loss.
- Number of unresolved runtime checks.
- Communication across cover overlaps or device boundaries.
- Reuse of tuned kernels and cached proof obligations.

This means a slower kernel may be preferred when it preserves more p-adic
precision, avoids expensive gluing checks, or keeps a rewrite in the exact
fragment.

## 7. Backend Execution Layer

Backends execute lowered plans. They should not decide mathematical validity.

Initial backend targets:

- `CpuScalar`: correctness-first fallback.
- `CpuSimd`: vectorized execution for ordinary numeric domains.
- `CpuThreaded`: parallel execution.

Later targets:

- `Cuda`
- `Rocm`
- `Metal`
- `WebGpu`
- `Llvm`
- `Mlir`

Backend traits should expose capabilities rather than hardcoded assumptions:

```rust
pub trait Backend {
    fn name(&self) -> &'static str;
    fn capabilities(&self) -> BackendCapabilities;
    fn compile(&self, plan: &LoweredPlan) -> Result<Executable>;
    fn execute(&self, executable: &Executable, args: &[ObjectRef]) -> Result<()>;
}
```

Not every backend should support every domain. A CUDA backend may support dense
`f32` tensor arithmetic but not arbitrary precision p-adics. A CPU backend may
support p-adic arithmetic but only through scalar or limb-vectorized code.

The runtime must allow graceful fallback and mixed execution.

Backends should also report semantic degradation. For example, a backend may
declare that a reduction is nondeterministic, that an operation uses approximate
transcendentals, or that a p-adic operation is implemented only up to fixed
precision. These reports become obligations for the verification layer or cause
the planner to reject the backend for exact workloads.

## 8. Verification And Property Testing Layer

Because the library targets nontrivial mathematics, verification is part of the
architecture rather than an afterthought. Verification is responsible for
discharging obligations produced by operators, rewrites, planners, and backends.

Verification should include:

- Shape and domain type checking.
- Invariant checking.
- Property-based tests from declared laws.
- Cross-backend consistency checks.
- Precision audits.
- Reference implementations.
- Runtime discharge of conditional contracts.
- Replayable audit trails for rewrite justifications.
- Optional links to external proof artifacts.

Examples:

- `v_p(x * y) = v_p(x) + v_p(y)` when both values are nonzero.
- `|x + y|_p <= max(|x|_p, |y|_p)`.
- Restriction maps compose correctly.
- Compatible local sections glue to a unique global section when the sheaf
  declares the relevant condition.
- Optimized tensor reductions match reference reductions within the declared
  numerical tolerance.

For approximate domains, equality should be domain-specific:

- Exact equality for finite fields.
- Tolerance-based equality for floating point.
- Ball containment or precision-bounded equality for p-adics.
- Isomorphism or natural equivalence for some categorical objects.

Verification modes:

- `Fast`: type, shape, and cheap invariant checks.
- `Audited`: records all rewrites and discharged obligations.
- `Checked`: inserts runtime checks for conditional contracts.
- `Reference`: compares against slow reference implementations.
- `Research`: property testing, randomized differential testing, and proof
  artifact export.

This layer should make unsoundness visible. Unsafe assumptions are allowed only
when explicitly requested by a caller or benchmark harness.

## p-adic Design Notes

p-adic support should be modeled as a valued non-Archimedean domain, not as a
custom integer encoding.

Essential metadata:

- Prime `p`.
- Precision `N`.
- Representation mode: canonical residue, expansion, ball, lazy.
- Valuation bounds.
- Error model.

Core operations:

- Addition, subtraction, multiplication.
- Inversion where valuation permits.
- Division with explicit domain checks.
- Valuation.
- Norm.
- Truncation and precision extension.
- Hensel lifting.
- Polynomial evaluation.
- Matrix operations over p-adic fields.

Important planning rules:

- Track precision loss explicitly.
- Prefer valuation-aware algorithms.
- Avoid converting to floating point except for diagnostics or heuristic cost
  estimates.
- Do not assume total ordering.
- Use ultrametric properties for pruning and error bounds.

Planner-level p-adic strategies:

- Compute coarse residues first, then lift only when required.
- Use valuation bounds to skip products or summands that cannot affect the
  requested precision.
- Choose between exact modular arithmetic, fixed-precision arithmetic, ball
  arithmetic, and lazy expansion based on downstream obligations.
- Select Hensel lifting when derivative and residue conditions discharge the
  relevant contracts.
- Prefer algorithms that expose precision loss structurally instead of hiding it
  inside scalar operations.

p-adic values should be able to carry both data and certificates:

```rust
pub struct PadicMeta {
    pub prime: u64,
    pub precision: Precision,
    pub valuation_bound: Option<ValuationBound>,
    pub representation: PadicRepresentation,
    pub error: PadicErrorModel,
}
```

The first implementation can be small and fixed-precision, but the metadata
should leave room for adaptive precision and ball arithmetic.

## Sheaf And Layer-Theoretic Design Notes

"Layer theory" can be interpreted operationally as computation over structured
layers of local-to-global data. The closest stable mathematical abstraction is
sheaf-like computation over a base space, site, cover, or stratified object.

Tokitai should model this with:

- `BaseSpace` or `Site`.
- `Open` or generalized object of the site.
- `Cover`.
- `Section`.
- `RestrictionMap`.
- `GluingCondition`.
- `Sheaf`.

Minimal operations:

- Restrict a section to a smaller open.
- Check compatibility of local sections on overlaps.
- Glue compatible sections.
- Apply local operators cover-wise.
- Pull back or push forward along a morphism of bases.

The key execution idea is locality. If an operator is declared local, the planner
may compute it independently on cover elements and then glue or validate
compatibility. This is the sheaf-theoretic analogue of tensor fusion and tiling:
the mathematical structure exposes legal decomposition.

Planner-level sheaf strategies:

- Replace a global local operator with cover-wise local operators plus gluing.
- Choose a smaller basis or refinement when it reduces overlap checks.
- Push restrictions through local operators to avoid computing unused global
  data.
- Cache overlap computations as reusable obligations.
- Use Cech-style complexes when global consistency is the computation, not just
  a validation step.

Layer theory should therefore be treated as a first-class decomposition model:
each layer has local state, transition or restriction maps, compatibility rules,
and optional global reconstruction. This model can represent sheaves directly,
but it can also describe neural network layers, numerical multigrid levels,
precision refinements, and hardware memory hierarchy.

Potential future objects:

- Presheaves.
- Sheaves of modules.
- Cech complexes.
- Derived or cohomological approximations.
- Stratified spaces.
- Fiber bundles.
- Layered computational graphs where each layer has local rules and transition
  maps.

## Initial Milestones

### Milestone 0: Contracted Design Skeleton

- Create module boundaries.
- Define core metadata structs.
- Define conservative traits for domains, objects, contracts, operators, and
  backends.
- Add evidence and obligation data structures.
- Add this architecture document.

### Milestone 1: Minimal Semantic Runtime

- Implement dense tensor metadata as the first object representation.
- Implement finite-field or integer-like exact arithmetic plus `f32` as a
  contrasting approximate domain.
- Implement `Add`, `Mul`, `Map`, and `Reduce`.
- Add shape inference, domain checks, and contract propagation.

### Milestone 2: Evidence-Carrying Rewrites

- Add law and contract declarations.
- Add rewrite justifications and obligation tracking.
- Distinguish exact, floating, and approximate equality.
- Add rewrite guards based on lawfulness.
- Add property-based testing hooks generated from contracts.

### Milestone 3: LayerIR Prototype

- Represent tensor tiling, precision layers, and cover layers using one IR
  vocabulary.
- Add local/global operator classification.
- Add simple local-to-global planning.
- Emit auditable decomposition plans.

### Milestone 4: p-adic Prototype

- Implement a small fixed-precision p-adic domain.
- Support valuation, addition, multiplication, inversion, and truncation.
- Add p-adic property tests.
- Add simple p-adic vector and matrix operations.
- Add valuation-aware planning for at least one reduction-like operation.

### Milestone 5: Sheaf Prototype

- Implement a finite site or finite topological basis.
- Implement sections, restrictions, covers, and compatibility checks.
- Implement local operator application and gluing.
- Add Cech-style construction only after the basic model is stable.

### Milestone 6: Adaptive Planning

- Add execution plans.
- Add CPU backend selection.
- Add representation selection.
- Add shape and precision aware dispatch cache.
- Add cost model terms for precision loss, obligations, and locality.

### Milestone 7: Performance Backends

- Add SIMD and threading.
- Add kernel registry.
- Add benchmarking and tuning records.
- Explore GPU or MLIR integration.

## Design Constraints

- Keep mathematical semantics independent from storage representation.
- Make precision changes explicit.
- Avoid assuming floating point algebraic laws.
- Avoid assuming total ordering or Archimedean norms.
- Treat tensor computation as a special case of typed object computation.
- Prefer small, inspectable core abstractions over large generic hierarchies.
- Let optimization be justified by declared laws and checked invariants.
- Never perform a semantic rewrite without evidence.
- Treat precision, locality, and representation as planner resources.
- Make degraded backend semantics explicit.

## Recommended First Code Structure

```text
src/
  lib.rs
  domain/
    mod.rs
    structure.rs
    contract.rs
    valued.rs
    padic.rs
  object/
    mod.rs
    tensor.rs
    sheaf.rs
    meta.rs
  op/
    mod.rs
    signature.rs
    contract.rs
    arithmetic.rs
  ir/
    mod.rs
    semantic.rs
    contract.rs
    layer.rs
    schedule.rs
  planner/
    mod.rs
    obligation.rs
    plan.rs
  backend/
    mod.rs
    cpu.rs
  verify/
    mod.rs
    equality.rs
    properties.rs
```

The crate can still expose a simple high-level API later. The internal structure
should remain domain-first so that uncommon mathematical objects do not become
second-class extensions.

## 13. Post-P320 Refactor Backlog (Code Review Findings)

The post-P320 code review produced 13 concrete improvement phases
(P321-P333) covering refactors, API ergonomics, and documentation. The
phases are tracked in `todo.json` under `improvement_phases`.

### Tier A — Critical Refactors (5 phases)

- **P321**: Split `src/verify/mod.rs` (7,685 lines, 125 `pub use` re-exports)
  into focused submodules (`proof_replay`, `audit`, `reports`).
- **P322**: Replace remaining `unwrap()` in `src/checkpoint.rs` TKP1 binary
  parser with statically-safe inline 4-byte array construction.
- **P323**: Add `Shape::first_dim()`, `Shape::last_dim()`, `Shape::dim(index)`
  accessors so tensor code does not reach into `shape.dims[i]` directly.
- **P324**: Add `Padic::precision_consistent_with` and unify the
  `digits_base_p` / `digit_sum` / `lowest_digit` / `first_nonzero_digit`
  helpers via a private `digit_at(index)` helper.
- **P325**: Add `Tensor::try_from_vec` (Result-returning constructor that
  validates data length matches the product of static shape dimensions).

### Tier B — API Polish (5 phases)

- **P326**: Replace 125 individual `pub use` re-exports in
  `src/verify/mod.rs` with curated `pub mod` re-export blocks.
- **P327**: Add `CostModelWeights::balanced()` (default), `strict()` (3x
  base, 5x penalties), and `permissive()` (half base, 2x credits) constructors.
- **P328**: Add `Dim::Static.value()` and `value_or(default)` accessors
  so callers stop pattern-matching on the variant. Used by P323's refactor
  of `Tensor::flatten_first_axis`.
- **P329**: Add `Shape::ensure_rank` and `SectionTable::keys` / `values` /
  `iter` iterators.
- **P330**: Add `Tensor::truncate(n)` (drain prefix, return Vec) and
  `Tensor::into_inner` (consuming destructor) for MoE padding cleanup
  and meta/data hand-off.

### Tier C — Documentation (3 phases)

- **P331**: Add `examples/end_to_end_workflow.rs` covering
  `Tensor → plan → execute → verify` end-to-end (currently runs and prints
  `out.data = [4, 10, 18]` for `lhs=[1,2,3] * rhs=[4,5,6]`).
- **P332**: Add 3 `///` doc tests for top-level `Tokitai` facade methods
  (`cpu_only`, `padic_domain`, `execute_i64`).
- **P333**: This section — update `ARCHITECTURE.md` to reflect the
  post-P320 state.

### Test Baseline

The post-P320 baseline is **425 tests passing** across 76 test files
(393 passing in the pre-P321 baseline + 32 new tests added in P322-P330).
The full baseline runs cleanly with `cargo test --offline` and `cargo fmt --check`
produces no diffs.

### New Helpers in the post-P320 Surface

The following helpers were added in P306-P320 and are exercised by the
425-test baseline:

- **Tensor** (`src/object/tensor.rs`): `first`, `last`,
  `flatten_first_axis`, `element_equals`, `sum`, `reversed`, `min`, `max`,
  `truncate`, `into_inner`, `try_from_vec`.
- **Sheaf** (`src/object/sheaf.rs`): `SectionTable::len`, `is_empty`,
  `contains`, `keys`, `values`, `iter`; `FiniteSite::contains_all_opens`,
  `open_count`, `has_open`.
- **Padic** (`src/domain/padic.rs`): `Padic::digits_base_p` (lowest-digit-first base-p expansion), `Padic::lowest_digit`,
  `digit_sum`, `first_nonzero_digit`, `precision_consistent_with`,
  `same_prime_as`.
- **Shape** (`src/object/shape.rs`): `first_dim`, `last_dim`, `dim`,
  `ensure_rank`, plus `Dim::Static.value()` and `value_or()`.
- **CostModelWeights** (`src/planner/cost_model.rs`): `balanced()` (now
  the default), `strict()`, `permissive()`.

## 14. Post-P334 Functional Expansion (P335-P354)

The P335-P354 phases extend the operator surface, public documentation,
and developer ergonomics without touching the claim-boundary code that
the paper route depends on (audit, support matrix, release gate, theorem
binding, claim status, semantic conformance, HIP audit). Specifically:

- **P335-P339 (operator surface)**: 8 new arithmetic ops
  (SubOp, DivOp, ScalarAddOp, ScalarMulOp, PowOp, SqrtOp, Exp2Op, Log2Op),
  9 shape ops (ReshapeOp, TransposeOp, PermuteOp, SliceOp, ConcatOp,
  BroadcastOp, FlattenOp, SqueezeOp, UnsqueezeOp), 6 NN/activation ops
  (ReluOp, SigmoidOp, TanhOp, GELUOp, SoftmaxOp, LayerNormOp), 5 index
  ops (GatherOp, ScatterOp, IndexSelectOp, IndexAddOp, NonzeroOp), and
  9 reduction ops (SumOp, MeanOp, MaxOp, MinOp, ArgMaxOp, ArgMinOp,
  ProdOp, AnyOp, AllOp). All follow the same `op struct + OpSignature +
  infer + contracts + CPU helper + facade builder + integration test`
  pattern. The CPU backend is the correctness baseline; no GPU
  acceleration claim is associated with these ops.
- **P340 (doctests)**: 82 new `///` examples on the public `Tokitai`,
  `RuntimeOptions`, and `FacadeGraphBuilder` facade methods, bringing the
  doctest count to 85+ and making the rustdoc API self-explanatory.
- **P341 (book)**: 4 progressive tutorial chapters in `examples/book/`
  covering hello-tensor, shapes-and-reshape, p-adic-valuation-skip, and
  finite-sheaf-glue.
- **P342 (broken test fixes)**: 3 test files that depended on
  rocm-hip-only paths were re-gated or removed; the standard
  `cargo test --offline` now compiles and runs every test file.
- **P343 (graph DSL)**: a new closure-based `crate::ir::dsl::GraphBuilder`
  with `TensorHandle` and `CompiledGraph` lets users build a graph in
  a single expression. `Tokitai::graph(closure)` exposes the DSL through
  the public facade.
- **P344 (verify refactor)**: deferred to keep `src/verify/mod.rs`
  claim-boundary; tracked in `todo.json` for a future pass.
- **P349 (unwrap audit)**: 4 remaining public-path `unwrap()`s replaced
  with `expect()` / `map_or` / direct-indexing accessors.
- **P350 (integration test)**: `tests/mlp_forward_e2e.rs` exercises a
  full `Linear -> LayerNorm -> GELU -> Linear -> Softmax` chain through
  both the imperative and the DSL facade.
- **P352 (proptest)**: 5 new proptest files covering the new ops;
  `proptest = "1.5"` is now a dev-dependency.
- **P353 (architecture diagrams)**: 4 Mermaid diagrams in
  `docs/dev/architecture_diagrams.md` (8-layer, planner pipeline,
  proof-replay trace, theory-evidence chain).
- **P354 (this section)**: re-syncs the docs to reflect the new surface.

### Diagrams

See `docs/dev/architecture_diagrams.md` for visual overviews of the
eight architectural layers, the planner pipeline, the proof-replay
trace flow, and the theory-evidence chain.

### Test Baseline After P354

The post-P354 baseline is **611+ tests passing** (up from 425 at the
end of P333). The four pre-existing `tests/paper_artifacts.rs` failures
(claiming a `current_focus` for a P-number that has been re-baselined)
remain in scope of `tests/paper_artifacts.rs` only and are not part of
the canonical baseline.

### Test Baseline After P440

The post-P440 baseline is **693 tests passing** (up from 611+ at the
end of P354). The P381-P420 polish and documentation batches added the
observability test counterparts (`tests/claim_status_report.rs`,
`tests/submission_readiness.rs`) and the P426-P440 code-review improvement
batch added the variant-asserted `tests/error_variants.rs`, the
arithmetic `MatmulOp` / `FmaOp` / `MapOp` tests, the proptest `GeluOp` /
`SoftmaxOp` / `LayerNormOp` cases, and the `tests/README.md` per-item
coverage check. The full baseline runs cleanly with `cargo test --offline`
and `cargo fmt --check`.