I ran a minimal validation experiment on a question that’s been bothering me for a while:
Is enforcing global constraints on attention (e.g. doubly-stochastic / Birkhoff projections) sufficient to reduce semantic drift — or does geometry matter?
I compared three 2-layer transformer variants on a synthetic task with controlled semantic drift:
• Standard attention
• mHC-style doubly-stochastic mixing
• Toroidal (topology-constrained) attention
Key observations (single run, small model):
• Toroidal attention reduced drift by ~40% vs baseline
• Gradients were more stable under local topological constraints
• Doubly-stochastic mixing alone led to a coherence variance blow-up
This doesn’t claim better models — only that constraint ≠ structure, and locality matters.
Code + experiment setup here: [GitHub link]
Curious if others have seen similar failure modes with global mixing constraints.