1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
//! Prometheus instrumentation for the cached-input-pricing layer.
//!
//! Thin `record_*` helpers over the `metrics` facade, mirroring [`crate::metrics::errors`].
//! Conventions:
//! - Every metric name is `dwctl_cache_*`.
//! - Low-cardinality labels are `&'static str` literals (`outcome`/`reason`/`tier`/`result`/`marked`).
//! The one dynamic label is `model` (owned `String`), attached **only** by `record_token_volumes`,
//! which is emitted solely for cache-ENABLED models — a small, validated, bounded set (the alias
//! has a tariff). The all-traffic metrics (`marker_requests`, `requests`) carry **no `model` label**,
//! because the raw request `model` is unvalidated there (unknown/typo strings) and would be
//! unbounded cardinality. Never label `model` from unvalidated input, nor by principal / api-key /
//! correlation-id / prefix-hash.
//! - Counters created lazily on first emission (no pre-registration needed, as in
//! `errors.rs`); histograms use the recorder's default buckets unless tuned in the
//! recorder builder.
use ;
// ── Adoption ──────────────────────────────────────────────────────────────────
/// Every chat-completions request the layer sees, labelled by whether the client included
/// any `cache_control` markers. Adoption % = `marked="true"` / all. Measured at the strip
/// step — before the model is validated — so it covers ALL traffic; deliberately **no
/// `model` label** (raw, unbounded user input here). Per-model lives on `record_token_volumes`.
// ── Request outcome + token volumes ───────────────────────────────────────────
/// Request-level cache behaviour across ALL traffic. `outcome` ∈ `read` | `create_only` |
/// `read_and_create` | `zero_active` (enabled but nothing cached) | `inactive` (model not
/// enabled / no key). **No `model` label** — `inactive` covers unknown/typo models (raw
/// input → unbounded); per-model volumes are on `record_token_volumes` (enabled-only).
/// Token volumes for a cache-active request, from the classifier's split. The `model`
/// label is safe here (unlike the all-traffic metrics): this is emitted only for
/// cache-ACTIVE requests, so `model` is always a tariffed/enabled alias — a bounded,
/// validated set, never raw request input. Feeds the cache-reuse rate =
/// `read / (read + creation)`. The full-prompt hit rate and the uncached residual are
/// [DB]-derived from `http_analytics` (`prompt_tokens − read − creation`), which is also
/// the authoritative billing source — this counter is the real-time *classified* volume.
// ── Classify path ─────────────────────────────────────────────────────────────
/// Classify-join result. `outcome` ∈ `ok` | `deadline_exceeded` | `error` | `panicked`.
/// `deadline_exceeded` is the primary "tokenizer/index outage is adding latency" signal.
/// Wall-time of the fork-joined classify (parallel with the upstream call).
/// Index-lookup (cache READ) latency — the `prompt_cache_entries` point-lookup on its own,
/// separate from tokenize and commit so a read-path p99 spike is attributable to the DB read
/// (or connection acquisition) rather than buried inside `classify_duration`.
/// Why a cache-enabled request cached nothing. `reason` ∈ `no_markers` | `unparseable`
/// | `tokenizer_unmapped` | `tokenize_failed` | `count_mismatch` | `below_floor`.
/// (Non-enabled models / missing keys are counted by `record_request_outcome{outcome="inactive"}`.)
// ── Tokenizer-svc ─────────────────────────────────────────────────────────────
/// `outcome` ∈ `ok` | `http_error` | `unmapped_422` | `transport_error` (timeout/connection).
/// tokenizer-svc round-trip latency.
/// alias→version moka cache. `result` ∈ `hit` | `miss`.
// ── Resolver L1 caches ────────────────────────────────────────────────────────
/// principal resolver memo. `result` ∈ `hit` (L1 hit on a known key) | `miss` (L1 miss,
/// resolved to a known key via DB) | `unknown_key` (key not found — cached `None` or a fresh
/// DB miss). L1 hit-rate ≈ `hit / (hit + miss)`; key-probe volume ≈ `unknown_key`.
/// model-config resolver (the 60s-TTL enablement cache). `result` ∈ `hit` | `miss`.
// ── Commit path (success-gated write) ─────────────────────────────────────────
/// `result` ∈ `ok` | `error` | `timeout`.
/// Write skipped by the success gate. `reason` ∈ `non_2xx` (non-billable response) |
/// `stream_aborted` (mid-stream error frame or client disconnect). A high veto rate flags
/// upstream instability / wasted classify.
/// Commit (index write/refresh) latency.
// ── Safety / anti-abuse ───────────────────────────────────────────────────────
/// A cacheable request whose body couldn't be buffered — the configured body limit was
/// exceeded OR the body stream errored — so it degraded to no-cache. (`to_bytes` doesn't
/// distinguish the two without depending on axum-internal error types; both mean "couldn't
/// read the body to classify".) No `model` label: the body isn't parsed once the read fails.
/// Marker validation rejection — now a 400 to the client (the cache layer rejects synchronously
/// before forwarding), not a silent no-cache. `reason` ∈ `too_many_breakpoints` | `invalid_ttl` |
/// `unsupported_type` | `tier_disabled` (a valid tier the platform has turned off in config) |
/// `malformed_cache_control` (a non-object `cache_control`, a missing/non-string `type`, or a non-string `ttl`).