noos 0.4.1

Reliability layer for Rust LLM agents: scope drift, cost circuit breaks, and procedural correction memory as event-driven Decisions.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# Noos

[![CI](https://github.com/Triangle-Technology/noos/actions/workflows/ci.yml/badge.svg)](https://github.com/Triangle-Technology/noos/actions/workflows/ci.yml)
[![crates.io](https://img.shields.io/crates/v/noos.svg)](https://crates.io/crates/noos)
[![docs.rs](https://docs.rs/noos/badge.svg)](https://docs.rs/noos)
[![license](https://img.shields.io/crates/l/noos.svg)](LICENSE)

> Reliability layer for Rust LLM agents. Detects tool-call retry loops,
> learns from user corrections across sessions, halts on cost × quality
> compound, flags scope drift — all as event-driven decisions, no
> framework lock-in.

`Regulator` sits between your agent's retry loop and your LLM. Every turn
your loop runs, you emit a handful of events — user message, LLM response,
tokens spent, tool calls, optional quality signal. `Regulator` returns a
`Decision`: `Continue`, `CircuitBreak`, `ProceduralWarning`,
`ScopeDriftWarn`, or `LowConfidenceSpans`. Your loop branches on the
variant and keeps moving.

Nothing in Noos wraps your LLM client. There is no framework lock-in and no
runtime dependency on a specific model. The event surface is a single enum
your code owns.

## Problem

Two failure modes cost real money in production:

- **Tool-call retry loops.** A November 2025 LangChain deployment looped
  between two agents for 11 days and burned through $47,000 before anyone
  noticed. `max_iterations` and transport-level retry crates don't catch
  this — the tool call *succeeds* at the protocol level every time; it's
  the agent's decision to re-invoke the same tool that needs to halt.
- **Wasted retries on non-retryable errors.** In a 200-task benchmark,
  90.8% of retries were wasted because systems re-ran hallucinated tool
  names and doomed re-prompts. A compound halt predicate (cost AND
  declining quality) catches this where a cost-only budget doesn't.

Two more are slower-burning but corrode trust over time:

- **Procedural correction amnesia.** Content-memory stores (Mem0, Letta,
  Zep) remember what the user said, but not that the user *keeps pushing
  back the same way*. The third time a user says "no telemetry",
  something structural should change before the fourth generation.
- **Scope drift.** The LLM answered a *different* question than the one
  asked — typically reaches the user and only gets caught if the user
  notices.

Noos surfaces all four signals *during* the loop so the app can act
before delivery: halt a runaway tool loop, halt a cost-burning
quality-dropping retry loop, consult learned corrections before the next
generation, strip drifted material.

## Quick start

```toml
[dependencies]
noos = "0.3"
```

```rust
use noos::{Decision, LLMEvent, Regulator};

let mut regulator = Regulator::for_user("alice").with_cost_cap(2_000);

regulator.on_event(LLMEvent::TurnStart {
    user_message: "Refactor fetch_user to be async".into(),
});

// ... call your LLM of choice ...

regulator.on_event(LLMEvent::TurnComplete {
    full_response: response_text.clone(),
});
regulator.on_event(LLMEvent::Cost {
    tokens_in,
    tokens_out,
    wallclock_ms,
    provider: Some("anthropic".into()),
});

match regulator.decide() {
    Decision::Continue => { /* send response to user */ }
    Decision::CircuitBreak { reason, suggestion } => {
        // Halt the retry loop; surface `suggestion` to the user. The
        // `reason` variant distinguishes CostCapReached /
        // QualityDeclineNoRecovery / RepeatedToolCallLoop.
    }
    Decision::ProceduralWarning { patterns } => {
        // Read `pattern.example_corrections` into the next prompt so
        // the LLM sees the user's prior pushback before generating.
        // Or use the 0.2.2 helper:
        //   let prompt = regulator.inject_corrections(&user_message);
    }
    Decision::ScopeDriftWarn { drift_tokens, drift_score, .. } => {
        // drift_score >= 0.50 means the response added keywords with no
        // anchor in the task. Strip, re-prompt, or annotate.
    }
    Decision::LowConfidenceSpans { .. } => { /* reserved for future */ }
    // `Decision` is `#[non_exhaustive]` since 0.2.1 — future variants
    // (e.g. a multi-concern aggregator) require a wildcard arm.
    _ => {}
}
```

Full per-event contract + timing rules (pre- vs post-generation `decide()`
calls, per-query vs per-task regulator lifetime): see
[`docs/regulator-guide.md`](docs/regulator-guide.md).

**Migrating from another tool?** See
[`docs/migrating.md`](docs/migrating.md) for concrete recipes replacing
LangChain `recursion_limit`, Mem0/Letta content memory, tenacity-style
retry wrappers, and Langfuse/Arize observability.

## What this closes that competitors don't

Leading with the two claims that are net-new across the Python and Rust
agent ecosystems as of 2026-Q2 — ordering follows strongest wedge first:

|                               | Tool-loop halt (structural) | Pattern extraction from corrections | Cost × quality halt | Scope drift (pre-delivery) | Remember content | Log turns |
|-------------------------------|:---------------------------:|:-----------------------------------:|:-------------------:|:--------------------------:|:----------------:|:---------:|
| Langfuse / Arize / Helicone   |||||||
| Mem0 / Letta / LangMem / Zep  ||           partial¹           |||||
| Portkey / litellm / OpenRouter|||   transport only    ||||
| Tenacity / backoff            |||||||
| LangGraph / CrewAI / AutoGen  |   max_iterations only²   ||   max_iterations    ||||
| **Noos**                      |           ****             |                ****                |       ****         |           ****            |   ✓ structural   ||

¹ LangMem (Python) generates procedural patterns by asking an LLM to
rewrite the system prompt. Noos does it structurally — per-cluster
correction counting, no LLM-in-the-loop, deterministic, sub-millisecond.

² `recursion_limit` / `maxIterations` count total steps regardless of
what's repeating. Noos's `RepeatedToolCallLoop` fires on the
*signature* of a retry pathology: 5+ consecutive invocations of the
same tool without interleaving. That's the shape of the $47k LangChain
incident referenced above.

Noos is not a logging layer — pair it with Langfuse / Arize / Helicone
if you want observability on top of the regulatory decisions. The value
is the real-time decision surface during the loop, not the post-hoc
record.

## Demos

Five flagship demos, each closing one loop competitors cannot. All run
canned by default (no LLM required); demos 2–4 have `ollama` and
`anthropic` live modes; demo 5 is timing-based and canned-only.

- [`regulator_tool_loop_demo`]examples/regulator_tool_loop_demo.rs *(0.3.0, leads)*
  — agent calls `search_orders` 5× in a row with tweaked args, each
  returning an empty result. Protocol-level successful calls — retry /
  backoff crates don't flag them; `max_iterations` bounds *total*
  iteration, not consecutive-same-tool. `RepeatedToolCallLoop` fires
  structurally.
- [`regulator_correction_memory_demo`]examples/regulator_correction_memory_demo.rs
  — 3 explicit corrections build a `CorrectionPattern`; state exports
  → JSON round-trips → imports; next-session turn on the same cluster
  fires `ProceduralWarning` pre-generation with stored example
  corrections attached. Uses `inject_corrections` helper since 0.2.2.
- [`regulator_implicit_correction_demo`]examples/regulator_implicit_correction_demo.rs *(0.3.1-dev, new)*
  — same `ProceduralWarning` outcome as above, but WITHOUT the app
  having to emit explicit `UserCorrection` events. Three retries
  within the configured window on the same topic cluster → correction
  pattern emerges automatically. Directly addresses the adoption gap
  that "chat UIs rarely surface explicit correction signals".
- [`regulator_cost_break_demo`]examples/regulator_cost_break_demo.rs
  — 3 retry turns with declining quality trip
  `CircuitBreak(CostCapReached)` on turn 3. Output demonstrates priority
  ordering (drift warnings on turns 1–2, circuit break dominates on 3).
- [`regulator_scope_drift_demo`]examples/regulator_scope_drift_demo.rs
  — a refactor response adds logging / error handling / telemetry nobody
  asked for. Output shows the exact `drift_tokens` list pre-delivery.

```bash
cargo run --example regulator_tool_loop_demo
cargo run --example regulator_correction_memory_demo
cargo run --example regulator_implicit_correction_demo
cargo run --example regulator_cost_break_demo
cargo run --example regulator_scope_drift_demo
```

## Eval numbers

Tier 2.2 benchmark: mixed-cluster workload (FactQA / Debug / Refactor / Ambiguous),
regulator-enabled arm vs naive-retry baseline. **Honest reporting** — we
distinguish canned (synthetic-oracle) reproducibility numbers from
real-LLM + real-judge data points.

### Real-LLM + real-judge results

The only fair-comparison metric is: real model generating responses,
real grader scoring them, both arms running against the same model
state. Two datasets, both with `NOOS_JUDGE=anthropic` replacing the
synthetic oracle with Claude Haiku:

**Ollama phi3:mini + Haiku judge, interleaved arms (N=29 fair pairs)**

|                        | baseline | regulator |   Δ  |
|------------------------|---------:|----------:|-----:|
| total cost (tokens_out)|   25 494 |    24 118 | **−5.4 %** |
| mean quality           |    0.510 |     0.484 | −0.026 (noise) |
| queries circuit-broken |        0 |         3 ||
| attempts               |       48 |        37 | **−23 %** |

**Anthropic claude-haiku-4-5 generator + Haiku judge (N=50)**

|                        | baseline | regulator |   Δ  |
|------------------------|---------:|----------:|-----:|
| total cost (tokens_out)|    9 100 |    10 157 | +11.6 % (regulator more expensive) |
| mean quality           |    0.570 |     0.618 | +0.048 (noise) |
| queries circuit-broken |        0 |         0 ||
| attempts               |       58 |        64 | +10 % |

**Reading**: Quality deltas (|Δ| ≤ 0.05) sit inside the noise bar for
these sample sizes (SE(Δmean) ≈ 0.04–0.05). **No detectable quality
regression from adding the regulator** on either model. Cost direction
is model-dependent — phi3 over-generates on CPU and the regulator's
`CircuitBreak` cuts real excess; Haiku is already concise so the
regulator's overhead is slightly net-negative.

```bash
# Real Ollama + Haiku judge (requires local Ollama + ANTHROPIC_API_KEY)
NOOS_JUDGE=anthropic \
  cargo run --release --example task_eval_real_llm_regulator -- ollama

# Real Anthropic + Haiku judge (requires ANTHROPIC_API_KEY only)
NOOS_JUDGE=anthropic \
  cargo run --release --example task_eval_real_llm_regulator -- anthropic
```

### Canned (synthetic oracle, reproducibility guard only)

Used by CI to catch regressions in the harness itself. **These numbers
do NOT transfer to real LLMs** — the oracle's per-retry quality decline
triggers `QualityDeclineNoRecovery` in a way real graders don't:

|                        | baseline | regulator |   Δ  |
|------------------------|---------:|----------:|-----:|
| total cost (tokens_out)|   16 040 |    11 360 | −29.2 % |
| mean quality           |    0.598 |     0.616 | +0.018 |
| queries circuit-broken |        0 |         9 ||

```bash
cargo run --release --example task_eval_real_llm_regulator
```

The +46 % quality-per-1k claim that historically appeared in this table
is a scoring artifact of `Cluster::canned_quality(retry)`, not a signal
on real workloads. It's preserved as a bit-for-bit CI guard so changes
to the harness surface explicitly.

### What the regulator provides on real workloads

- **Quality parity** — no detectable regression from adding the wrapper
- **Scope-drift warnings** on majority of turns (real LLMs drift more
  than canned responses)
- **Circuit-break halts** on Ambiguous-cluster retry loops when the
  model exhibits `QualityDeclineNoRecovery` — real on phi3 over 5+
  retries, rarer on Haiku
- **Procedural-memory surfacing** once `MIN_CORRECTIONS_FOR_PATTERN=3`
  threshold trips (see `regulator_correction_memory_demo`)
- **Cost-cap circuit-breaks** (not triggered in this eval — cap sized
  above typical query cost)

The regulator is infrastructure, not a quality booster. Eval confirms
it doesn't cost you measurable quality; the other value props are
qualitative surfacings that the baseline simply lacks.

## Observability

One-call metrics snapshot for Prometheus / Datadog / StatsD pipelines
— no need to call eight individual accessors:

```rust
let snap = regulator.metrics_snapshot();
for (key, value) in snap {
    metrics_client.gauge(&key, value);
}
```

All keys are `noos.` prefixed and stable across releases. Covers
confidence, logprob coverage, cumulative token spend, cost cap, tool
call / duration / failure counts, and the implicit-correction counter.
`decide()` is not sampled — it's an explicit call point — so
`metrics_snapshot()` is cheap enough to call every turn.

## Performance

Criterion benchmarks on the hot path (release build, Windows /
Ryzen-class CPU — re-run locally with `cargo bench` for your hardware):

| Operation                                | Median | Throughput |
|------------------------------------------|-------:|-----------:|
| `on_event(Cost)`                         |  20 ns |   49 M/s   |
| `on_event(Token)`                        |  44 ns |   23 M/s   |
| `on_event(ToolCall)`                     | 249 ns |   4.0 M/s  |
| `decide()``Continue`                  | 2.0 µs | 498 K/s    |
| `decide()``ScopeDriftWarn` (full)     | 3.7 µs | 268 K/s    |
| `on_event(TurnComplete)`                 | 3.9 µs | 255 K/s    |
| `on_event(TurnStart)`                    |  22 µs |  46 K/s    |
| `export``from_json` roundtrip         | 1.2 µs | 836 K/s    |
| **Realistic turn** (1 × start + 100 × token + complete + cost + decide) | **30 µs** | 33 K turns/s |

Interpretation: a realistic turn at 100 streamed tokens adds **~30 µs of
regulator overhead**, dominated by `TurnStart` (keyword extraction /
cluster computation). An LLM call at 200 tokens/sec runs for ~500 ms
per turn — Noos overhead is **six orders of magnitude smaller** than
the LLM latency it wraps. `decide()` on the Continue path is 2 µs, so
calling it multiple times per turn (pre-generation probe + post-delivery
probe + per-cost-event probe) remains cheap.

```bash
cargo bench --bench regulator
```

Benchmarks are not in the published crate — they live in `benches/`
for internal measurement only.

## OpenTelemetry GenAI ingestion

If your agent is already instrumented with the [OTel GenAI semantic
conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/), feed
your spans to Noos without rewiring your event bus:

```rust
use noos::{regulator::otel, Regulator};

let span: serde_json::Value = get_otel_span_json();
let mut r = Regulator::for_user("alice");
for event in otel::events_from_span(&span) {
    r.on_event(event);
}
match r.decide() { /* ... */ }
```

Maps `gen_ai.user.message` → `TurnStart`, `gen_ai.assistant.message` →
`TurnComplete`, `gen_ai.usage.*` + span duration → `Cost`, and
`gen_ai.tool.message` → `ToolCall` / `ToolResult`. See
[`src/regulator/otel.rs`](src/regulator/otel.rs) for the full attribute
mapping table.

## Bindings

Cross-language bindings mirror the Rust API 1:1. The event / decision
vocabulary is identical across all three; the `.kind` string pattern
replaces Rust's exhaustive `match` where the host language lacks
data-variant enums.

### Python — `bindings/python/`

```bash
pip install noos   # once published; PyO3 abi3-py39 wheels
```

```python
from noos import Regulator, LLMEvent

r = Regulator.for_user("alice")
r.with_cost_cap(2000)
r.on_event(LLMEvent.turn_start("Refactor fetch_user"))
r.on_event(LLMEvent.turn_complete(response))
r.on_event(LLMEvent.cost(tokens_in=25, tokens_out=800, wallclock_ms=500))

d = r.decide()
if d.kind == "circuit_break":
    print(d.suggestion)
```

See [`bindings/python/README.md`](bindings/python/README.md).

### Node.js / TypeScript — `bindings/node/`

```bash
npm install @triangle-technology/noos   # once published; napi-rs native addon
```

```typescript
import { Regulator, LLMEvent } from '@triangle-technology/noos'

const r = Regulator.forUser('alice')
r.withCostCap(2_000)
r.onEvent(LLMEvent.turnStart('Refactor fetch_user'))
r.onEvent(LLMEvent.turnComplete(response))
r.onEvent(LLMEvent.cost(25, 800, 500, 'anthropic'))

const d = r.decide()
if (d.kind === 'circuit_break') console.log(d.suggestion)
```

TypeScript `.d.ts` auto-generated from the Rust source. See
[`bindings/node/README.md`](bindings/node/README.md).

## Status (2026-04-18, crate `noos 0.4.0` on crates.io)

Path 2 MVP + three feature expansions shipped:

- **0.2.0** rebranded crate name and Rust types from `Nous*` to `Noos*`
  (crates.io name was taken).
- **0.2.1** added `#[non_exhaustive]` to `LLMEvent` / `Decision` /
  `CircuitBreakReason` so future variants don't break downstream
  exhaustive matches; `#[must_use]` on `Decision`; fixed a silent
  `LearnedState` export bug where LC `tick` + `gain_mode` were never
  flushed to the exported snapshot.
- **0.2.2** Path B — `Regulator::corrections_prelude()` /
  `inject_corrections(&str) -> String` helpers replace the 15-line
  hand-threading recipe for `ProceduralWarning`.
- **0.3.0** Path A — tool-call observation channel. New events
  `LLMEvent::ToolCall` + `LLMEvent::ToolResult`, new
  `CircuitBreakReason::RepeatedToolCallLoop`, new per-turn tool-stats
  accessors, and a fourth flagship demo. Catches the retry-loop failure
  mode that transport-retry crates and `max_iterations` don't.
- **0.4.0** — OTel GenAI ingestion adapter
  (`regulator::otel::events_from_span`), implicit-correction detector
  (`Regulator::with_implicit_correction_window`) that fires on
  temporal-proximity + topic-continuity retries without requiring
  explicit `UserCorrection` events, `metrics_snapshot()` one-call
  observability dump for Prometheus / Datadog / StatsD, migration
  guide (`docs/migrating.md`) with 4 recipes replacing LangChain /
  Mem0 / tenacity / Langfuse wiring, and an honest eval posture
  replacing the synthetic-oracle headline numbers with real-LLM +
  real-judge data (quality parity, model-dependent cost direction).
  Internal: interleaved-arm eval methodology + crash-safe checkpoint,
  criterion benchmarks (30µs/turn overhead), TLS fix in examples,
  11 adversarial tests pinning scope/tool-loop limitations.

Tests, clippy, and rustdoc all clean (run `cargo test && cargo clippy --lib --tests && cargo doc --lib --no-deps` to verify current counts).

### Empirically validated

- **Regulator (primary API)** — Cross-session strategy learning via
  `LearnedState` (Tier 1.1 synthetic, 3 seeds, 2σ bar). Survives
  `RegulatorState` export/import roundtrips so learned per-cluster
  strategies persist across process restarts.
- **Advanced (local Mamba/SSM inference only)** — Compensatory
  state-retention modulation via `CognitiveSession` (perplexity −1.86 %
  on emotional text, 3 runs bit-identical). Not exercised by `Regulator`;
  see [Advanced: direct cognitive-session access]#advanced-direct-cognitive-session-access.

### Shipped infrastructure with honest eval posture

- `Regulator` + four flagship demos + 50-query eval harness.
  Real-LLM + real-judge evals (see [Eval numbers]#eval-numbers)
  show quality parity with baseline (|Δ| ≤ 0.05 at N=29 and N=50, within
  noise). Canned oracle numbers are preserved as CI regression guards
  only, not as product claims.

### Measured limitations

- Metacognition abstention: infrastructure-only on simple tasks (matches
  smart baseline).
- Fatigue detection: regime-dependent (slower than rolling average on
  abrupt drops).

Full honest-scope table + full session-by-session project status in
[`CLAUDE.md`](https://github.com/Triangle-Technology/noos/blob/main/CLAUDE.md) on the
source repo (not published to crates.io — it cross-references internal
memory files that stay private).

## Advanced: direct cognitive-session access

Underneath `Regulator` runs `CognitiveSession`, a pipeline producing
continuous signals (`conservation`, `confidence`, strategy
recommendation, gain mode) plus delta-modulation output for local
Mamba/SSM inference.

Most integrations should prefer `Regulator` — the event surface is
smaller, the Decision enum is easier to branch on, and cross-session
learning survives through `RegulatorState` without touching the cognitive
layer directly. Use `CognitiveSession` directly only if you need one of:

- **Raw continuous signals** for a custom decision policy (e.g. you want
  `signals.conservation` to influence retrieval budget instead of relying
  on `CircuitBreak` to halt).
- **Delta-modulation hints** for local Mamba inference via the `candle`
  feature flag — the validated perplexity −1.86 % path.

Semantic contract for signals + closed-loop requirements:
[`docs/app-contract.md`](docs/app-contract.md). Same crate, one layer
deeper; `Regulator` and `CognitiveSession` can coexist in the same
process.

## Docs

- [`docs/regulator-guide.md`]docs/regulator-guide.md — app integrator's
  guide: event ordering, decision handling, gotchas.
- [`docs/app-contract.md`]docs/app-contract.md — semantic contract
  between Noos and your app.
- [`docs/regulator-design.md`]docs/regulator-design.md — Session 15
  Path 2 design doc.
- [`principles.md`]principles.md — 10 design principles any change must
  respect.

## License

MIT. See [`LICENSE`](LICENSE).