topological-coherence 0.1.2

I ran a minimal validation experiment on a question that’s been bothering me for a while:

Is enforcing global constraints on attention (e.g. doubly-stochastic / Birkhoff projections) sufficient to reduce semantic drift — or does geometry matter?

I compared three 2-layer transformer variants on a synthetic task with controlled semantic drift:
	•	Standard attention
	•	mHC-style doubly-stochastic mixing
	•	Toroidal (topology-constrained) attention

Key observations (single run, small model):
	•	Toroidal attention reduced drift by ~40% vs baseline
	•	Gradients were more stable under local topological constraints
	•	Doubly-stochastic mixing alone led to a coherence variance blow-up

This doesn’t claim better models — only that constraint ≠ structure, and locality matters.

Code + experiment setup here: [GitHub link]
Curious if others have seen similar failure modes with global mixing constraints.