Crate tailtriage_cli

Expand description

§tailtriage-cli

tailtriage-cli loads tailtriage run artifacts and turns them into a triage report.

Install it after capture instrumentation is in place.

The binary name is:

tailtriage

§What this tool does

tailtriage-cli owns the command-line artifact-analysis contract:

load a captured artifact
validate schema compatibility
produce JSON or human-readable triage output
invoke tailtriage-analyzer on loaded artifacts and rank likely bottleneck families
emit evidence and next checks

The output is intended to guide the next investigation step. It does not prove root cause on its own.

§Installation

cargo install tailtriage-cli

§Minimal usage

Default text output:

tailtriage analyze tailtriage-run.json

Machine-readable JSON output:

tailtriage analyze tailtriage-run.json --format json

tailtriage analyze <run.json> --format json emits the same pretty Report JSON as tailtriage_analyzer::render_json_pretty.

The CLI artifact loader requires at least one request event in requests. This is a CLI artifact-loading rule, not an in-process tailtriage-analyzer requirement for already-constructed Run values. CLI input is Run artifact JSON from disk. CLI does not consume Report JSON as input.

§How to read the result

Read output in this order:

primary_suspect.kind
primary_suspect.evidence[]
primary_suspect.next_checks[]

Then run one targeted check, change one thing, and re-run under comparable load.

§Representative output shape

{
  "request_count": 250,
  "p50_latency_us": 782227,
  "p95_latency_us": 1468239,
  "p99_latency_us": 1518551,
  "p95_queue_share_permille": 982,
  "p95_service_share_permille": 267,
  "inflight_trend": null,
  "warnings": [],
  "evidence_quality": {
    "request_count": 250,
    "queue_event_count": 250,
    "stage_event_count": 250,
    "runtime_snapshot_count": 0,
    "inflight_snapshot_count": 0,
    "requests": "present",
    "queues": "present",
    "stages": "present",
    "runtime_snapshots": "missing",
    "inflight_snapshots": "missing",
    "truncated": false,
    "dropped_requests": 0,
    "dropped_stages": 0,
    "dropped_queues": 0,
    "dropped_inflight_snapshots": 0,
    "dropped_runtime_snapshots": 0,
    "quality": "strong",
    "limitations": ["Runtime snapshots are missing, limiting executor and blocking-pressure interpretation."]
  },
  "primary_suspect": {
    "kind": "application_queue_saturation",
    "score": 90,
    "confidence": "high",
    "evidence": ["Queue wait at p95 consumes 98.2% of request time."],
    "next_checks": ["Inspect queue admission limits and producer burst patterns."],
    "confidence_notes": []
  },
  "secondary_suspects": [],
  "route_breakdowns": [],
  "temporal_segments": []
}

inflight_trend may be null when no in-flight gauges were captured.

route_breakdowns is always present in JSON output and is usually an empty array. It is populated only when at least two captured routes have enough completed requests and route-level context adds signal, such as different route-level primary suspects or a large route p95 latency spread. The global primary_suspect remains the primary full-run triage lead. Route breakdowns are supporting context only. They use route-attributed request, queue, and stage events. Runtime snapshots and in-flight gauges are global signals, so they are intentionally not attributed to individual routes. Route-level summaries do not prove per-route root cause.

temporal_segments is always present in JSON output and is usually an empty array. It is populated only when conservative within-run early/late checks detect material signal movement. The global primary_suspect remains global and unchanged by segment generation. Temporal segments are within-run hints, not proof of phase-specific root cause. Report warnings can explicitly call out large early/late p95 movement. Runtime and in-flight phase attribution uses timestamp-filtered segment windows and is limited when segment-filtered samples are sparse; when early/late windows overlap under concurrency, that timestamp-filtered runtime/in-flight attribution is approximate.

§What the report contains

A report can include:

request count
request latency percentiles (p50, p95, p99)
p95 queue/service share summaries
optional in-flight trend summary
report warnings from analysis/report generation (for example truncation-related)
structured evidence quality coverage/status summary
primary and secondary suspects

tailtriage analyze also prints loader/lifecycle warnings to stderr before the report. Those warnings are surfaced separately; they are not merged into the report warnings field.

Each suspect includes:

kind
score
confidence
evidence[]
next_checks[]
confidence_notes[] (present and empty unless evidence-aware caps affect confidence, or explicit ambiguity applies)

§Artifact compatibility contract

The tailtriage analyze workflow expects a supported tailtriage run artifact with minimum required content.

Current contract:

top-level schema_version is required
missing schema_version is rejected
non-integer schema_version is rejected
unsupported schema_version is rejected
current supported schema version is 1
requests must contain at least one request event
artifacts with an empty requests array are rejected by the CLI loader

For Rust in-process usage, use tailtriage-analyzer directly (analyze_run, render_text, typed Report). The stricter non-empty requests rule applies to CLI artifact loading from disk. Loader, parse, validation, and render errors return a non-zero process exit through the CLI.

§Important interpretation notes

suspects are investigation leads, not proof of root cause
truncation warnings mean the diagnosis is based on partial retained data
unfinished lifecycle warnings printed by the CLI indicate some requests were not completed cleanly
p95_queue_share_permille and p95_service_share_permille are independent percentile summaries and do not need to sum to 1000

§Scoring and warning behavior

Suspect ranking uses deterministic, proportional, evidence-aware scoring (0-100), not fixed suspect priority.

Scores rank suspects inside one report; they are not probabilities.
Confidence is score-derived ranking strength and may be evidence-quality capped; it is not causal certainty.
confidence_notes[] explain caps, including sparse samples, truncation, missing instrumentation, ambiguous top scores, and partial-vs-missing runtime snapshot limits.
Strong downstream tail-stage contribution can outrank weak blocking/runtime signals.
Strong queue pressure remains a high-confidence lead when queue share/depth evidence is dominant.

How to read before/after runs:

Compare p95 latency movement first.
Confirm primary suspect kind/rank and evidence direction.
Use score movement as supporting context, not a standalone pass/fail rule.

Why a score can stay flat or rise after mitigation:

Scores are relative to the evidence mix in each capture.
If total latency drops but the remaining tail is still dominated by one suspect family, that suspect score can remain high or increase.
This does not by itself mean mitigation failed when p95 and relevant evidence improve.

warnings[] may include:

evidence-quality warnings (for example low request counts or missing signal families)
ambiguity warnings when top suspects are genuinely close after calibration
additive truncation warnings when capture limits drop events

§Suspect kinds

The current report surface includes these suspect kinds:

application_queue_saturation
blocking_pool_pressure
executor_pressure_suspected
downstream_stage_dominates
insufficient_evidence

§When the result is `insufficient_evidence`

Usually the next step is to add more structure to capture:

add queue wrappers around suspected waits
add stage wrappers around suspected downstream work
optionally add runtime sampling if runtime pressure is unclear
re-run under comparable load

§What this tool does not do

tailtriage-cli does not capture instrumentation data.

Use capture-side crates for that:

tailtriage: recommended capture-side entry point
tailtriage-core: direct instrumentation primitives
tailtriage-controller: repeated bounded windows
tailtriage-tokio: runtime-pressure sampling
tailtriage-axum: Axum request-boundary integration tailtriage-cli is the command-line artifact loader and report emitter. For in-process Rust analysis/report APIs, use tailtriage-analyzer.

Modules§

artifact: Artifact loading and validation helpers for CLI workflows.