Topological Constraints for Coherent Language Models

Training-Dynamics Effects and a Null Result for Inference-Time Hallucination Reduction

Sylvain Cormier | Paraxiom Research | January 2026

Abstract

Residual geometry determines whether reasoning is stable. We show that transformer latent dynamics, operating on unconstrained vector spaces, lack the conserved quantities necessary for bounded inference. This establishes a hierarchy of sufficient conditions:

mHC (Birkhoff) ⊂ ERLHS (Hamiltonian) ⊂ Karmonic (Toroidal + Spectral)

The original hypothesis was that reduced drift would thereby reduce hallucination. A controlled replication did not support this out-of-sample: the inference-time bias yields no statistically significant hallucination reduction on held-out benchmarks (McNemar p=0.22). The geometric / training-dynamics effects are in-sample only.

Key Theoretical Contributions

1. Hallucination as Geometry Problem

We argue that hallucination is not a training data problem, an alignment failure, or an inherent limitation of autoregressive generation. Hallucination is a geometry problem: unconstrained latent dynamics permit arbitrary drift through latent space.

2. Hierarchy of Constraints

Level	Adds	Solves
mHC (Birkhoff polytope)	Bounded mixing	Training stability
ERLHS (Hamiltonian)	Conserved flow	Inference coherence
Karmonic (Toroidal + Spectral)	Spectral gap	Noise suppression

3. Spectral Alignment (Resonance)

Modes that align with the manifold's eigenstructure persist under repeated composition. Non-resonant modes decay as e^(-λt).

Epistemic boundary: Spectral alignment filters, stabilizes, and selects. It does not alone guarantee semantic correctness. A resonant mode may be stably wrong.

Empirical Results

Replication Update (March 2026)

A comprehensive independent replication (6 phases, 4 models, 3 benchmarks) found that the inference-time toroidal logit bias does not produce statistically significant hallucination reduction. The original v2 results were within LLM judge sampling variance. Full replication data in experiments/results/.

T&I Exact Replication (Qwen 7B, n=200, exact v2 methodology):

Metric	Baseline	Toroidal	Delta	p-value
T&I %	76.5%	74.5%	−2.0pp	0.22

Baseline matches v2 (76.5% vs 75.6%), confirming correct methodology. Toroidal shows opposite direction, not significant.

Alpha Sweep (Qwen 7B, n=100): Higher alpha monotonically degrades output. α=0.3 has zero effect; α≥5.0 causes catastrophic degradation (75–96% hallucination).

Active Ingredient: The hardening system prompt ("Answer concisely and truthfully") produces −14pp hallucination reduction (p=0.05) — this prompt engineering, not the toroidal bias, was the effective component in the Coherence Shield pipeline.

Original v2 Results (NOT REPLICATED)

Model	Baseline T&I	Toroidal T&I	Delta
Qwen 0.5B	16.9%	17.1%	+0.2pp
Qwen 1.5B	32.2%	32.8%	+0.6pp
Qwen 7B	75.6%	77.7%	+2.1pp
Mistral 7B	74.4%	77.2%	+2.8pp

Toy Model Validation (Still Valid)

Training-time toroidal attention masks on a 2-layer transformer:

Condition	Drift Rate	Interpretation
Baseline	0.0100	Control
Toroidal	0.0060	40% lower drift
Random sparse	0.1673	28x worse — proves topology matters, not sparsity

Critical Insight: Negative Control

Random graph masking (same sparsity, no topological structure) has drift rate 0.167 vs toroidal's 0.006. This proves it's specifically topological structure that matters — sparsity alone is insufficient. However, this training-time result does not transfer to inference-time logit biasing.

Repository Structure

topological-coherence/
├── src/
│   ├── topological_coherence/          # Python package (PyPI)
│   │   ├── logit_bias.py              # ToroidalLogitProcessor
│   │   ├── tonnetz.py                 # Tonnetz topology
│   │   ├── masks.py                   # Toroidal mask generation
│   │   ├── attention.py               # Attention layer variants
│   │   ├── drift.py                   # Drift measurement
│   │   └── tests/                     # Unit tests
│   └── lib.rs                         # Rust crate (crates.io)
├── paper/
│   ├── toroidal_hallucination_reduction_2026.tex  # v2 paper (multi-model)
│   └── toroidal_hallucination_reduction_2026.pdf
├── cormier_topological_coherence_2026.tex   # Theory paper (LaTeX)
├── cormier_topological_coherence_2026.pdf   # Theory paper (PDF)
├── results/                            # v2 benchmark data & charts
├── experiments/                        # Validation scripts
├── diagrams/                           # Result visualizations
├── docs/                               # Unified theory & diagrams
├── huggingface-space/                  # HuggingFace Space demo
├── presentation/                       # HTML presentation
├── Cargo.toml                          # Rust crate config
├── pyproject.toml                      # Python package config
└── LICENSE                             # Apache 2.0

Running the Experiment

Prerequisites

Python 3.8+
~500MB disk space for PyTorch

Installation

cd experiments
python3 -m venv venv
source venv/bin/activate
pip install torch numpy

Run

python tonnetz_validation.py

Expected runtime: ~4 minutes on CPU (no GPU required)

Expected Output

The experiment trains 4 models (baseline, mHC, toroidal, random) and reports:

Drift rate (lower = better semantic coherence)
Coherence variance (hidden state stability)
Gradient norm (training stability)

Theoretical Background

Tonnetz Topology

The Tonnetz is a 2D torus where:

Horizontal edges connect by perfect fifths
Vertical edges connect by major thirds
Diagonal edges connect by minor thirds

We use it as a constructive existence proof of a low-genus manifold with constant spectral gap—not as a claim about semantic universals.

Spectral Gap

For a d-dimensional torus T^d_N:

λ₁ = 2 - 2cos(2π/N) = Θ(1)

for fixed side length N, independent of total nodes N^d.

Important caveat: This holds for fixed torus side length N. Scaling N reintroduces gap decay as O(1/N²).

Why Not Implicit Smoothing?

Standard transformer components (LayerNorm, softmax temperature, multi-head averaging) provide some implicit spectral filtering. However, none impose topological constraints—they operate pointwise or via soft weighting, not via manifold structure. They smooth without providing a conserved quantity or spectral gap guarantee.

The distinction is between ad-hoc regularization (which helps) and geometric constraint (which bounds).

Citation

@misc{cormier2026topological,
  author = {Cormier, Sylvain},
  title = {Topological Constraints for Coherent Language Models: Training-Dynamics Effects and a Null Result for Inference-Time Hallucination Reduction},
  year = {2026},
  publisher = {Zenodo},
  url = {https://github.com/Paraxiom/topological-coherence}
}

Related Work

Paper	Topic	Link
Unified Theory	Conservative composition across ML, blockchain, consensus	docs/UNIFIED_THEORY.md
ERLHS	Hamiltonian framework for coherence-preserving ML	DOI: 10.5281/zenodo.17928909
Karmonic Mesh	Spectral consensus on toroidal manifolds	DOI: 10.5281/zenodo.17928991
mHC	Manifold-Constrained Hyper-Connections	arXiv:2512.24880
Graph Signal Processing	Spectral methods on graphs	Shuman et al., 2013

Key Equations

Toroidal Attention Mask (Eq. 17)

M_Tonnetz(i, j) = 1                           if d_Tonnetz(i, j) ≤ r
                  exp(-α · d_Tonnetz(i,j))    otherwise

Learned Toroidal Projection (Eq. 20)

φ_θ(e) = ( σ(W₁e) mod 1, σ(W₂e) mod 1 )

Adjacency Loss (Eq. 21)

L_topo = E[(a,b)~co-occur][d_T(φ(a), φ(b))] - λ · E[(a,c)~random][d_T(φ(a), φ(c))]

Limitations

Inference-time logit bias does not replicate: The v2 TruthfulQA improvements were within LLM judge sampling variance
Hyperparameter sensitivity: The OLMo +15.4% result came from a 100-configuration sweep (overfitting to test set)
Judge bias: LLM-judged evaluation uses Qwen-7B as both subject and judge, introducing variance
Bias magnitude: At practical α (0.1–0.3), the logit shift (~0.9 max) is too small to change argmax under greedy decoding. Under sampling, effects are not statistically significant at n=200.

Future Work

Training-time Karmonic regularization: The toy model shows topology matters at training time — the untested promising direction
Compare with other geometric constraints (hyperbolic, spherical)
Orthogonal projection with RAG-expanded evidence basis (question-only basis too restrictive)
Learned toroidal mappings (semantic embeddings instead of modular arithmetic)

License

Apache 2.0

Contact

Author: Sylvain Cormier
Email: sylvain@paraxiom.org
Organization: Paraxiom Research

"Geometric constraints provide one principled path to coherent artificial intelligence—not the only path, but a formally grounded one."

topological-coherence 0.3.1