Skip to main content

Crate tailtriage_cli

Crate tailtriage_cli 

Source
Expand description

§tailtriage-cli

tailtriage-cli loads tailtriage run artifacts and turns them into a triage report.

Install it after capture instrumentation is in place.

The binary name is:

tailtriage

§What this tool does

tailtriage-cli owns the command-line artifact-analysis contract:

  • load a captured artifact
  • validate schema compatibility
  • produce JSON or human-readable triage output
  • invoke tailtriage-analyzer on loaded artifacts and rank likely bottleneck families
  • emit evidence and next checks

The output is intended to guide the next investigation step. It does not prove root cause on its own.

§Installation

cargo install tailtriage-cli

§Minimal usage

Default text output:

tailtriage analyze tailtriage-run.json

Machine-readable JSON output:

tailtriage analyze tailtriage-run.json --format json

tailtriage analyze <run.json> --format json emits the same pretty Report JSON as tailtriage_analyzer::render_json_pretty.

The CLI artifact loader requires at least one request event in requests. This is a CLI artifact-loading rule, not an in-process tailtriage-analyzer requirement for already-constructed Run values. CLI input is Run artifact JSON from disk. CLI does not consume Report JSON as input.

§How to read the result

Read output in this order:

  1. primary_suspect.kind
  2. primary_suspect.evidence[]
  3. primary_suspect.next_checks[]

Then run one targeted check, change one thing, and re-run under comparable load.

§Representative output shape

{
  "request_count": 250,
  "p50_latency_us": 782227,
  "p95_latency_us": 1468239,
  "p99_latency_us": 1518551,
  "p95_queue_share_permille": 982,
  "p95_service_share_permille": 267,
  "inflight_trend": null,
  "warnings": [],
  "evidence_quality": {
    "request_count": 250,
    "queue_event_count": 250,
    "stage_event_count": 250,
    "runtime_snapshot_count": 0,
    "inflight_snapshot_count": 0,
    "requests": "present",
    "queues": "present",
    "stages": "present",
    "runtime_snapshots": "missing",
    "inflight_snapshots": "missing",
    "truncated": false,
    "dropped_requests": 0,
    "dropped_stages": 0,
    "dropped_queues": 0,
    "dropped_inflight_snapshots": 0,
    "dropped_runtime_snapshots": 0,
    "quality": "strong",
    "limitations": ["Runtime snapshots are missing, limiting executor and blocking-pressure interpretation."]
  },
  "primary_suspect": {
    "kind": "application_queue_saturation",
    "score": 90,
    "confidence": "high",
    "evidence": ["Queue wait at p95 consumes 98.2% of request time."],
    "next_checks": ["Inspect queue admission limits and producer burst patterns."],
    "confidence_notes": []
  },
  "secondary_suspects": [],
  "route_breakdowns": [],
  "temporal_segments": []
}

inflight_trend may be null when no in-flight gauges were captured.

route_breakdowns is always present in JSON output and is usually an empty array. It is populated only when at least two captured routes have enough completed requests and route-level context adds signal, such as different route-level primary suspects or a large route p95 latency spread. The global primary_suspect remains the primary full-run triage lead. Route breakdowns are supporting context only. They use route-attributed request, queue, and stage events. Runtime snapshots and in-flight gauges are global signals, so they are intentionally not attributed to individual routes. Route-level summaries do not prove per-route root cause.

temporal_segments is always present in JSON output and is usually an empty array. It is populated only when conservative within-run early/late checks detect material signal movement. The global primary_suspect remains global and unchanged by segment generation. Temporal segments are within-run hints, not proof of phase-specific root cause. Report warnings can explicitly call out large early/late p95 movement. Runtime and in-flight phase attribution uses timestamp-filtered segment windows and is limited when segment-filtered samples are sparse; when early/late windows overlap under concurrency, that timestamp-filtered runtime/in-flight attribution is approximate.

§What the report contains

A report can include:

  • request count
  • request latency percentiles (p50, p95, p99)
  • p95 queue/service share summaries
  • optional in-flight trend summary
  • report warnings from analysis/report generation (for example truncation-related)
  • structured evidence quality coverage/status summary
  • primary and secondary suspects

tailtriage analyze also prints loader/lifecycle warnings to stderr before the report. Those warnings are surfaced separately; they are not merged into the report warnings field.

Each suspect includes:

  • kind
  • score
  • confidence
  • evidence[]
  • next_checks[]
  • confidence_notes[] (present and empty unless evidence-aware caps affect confidence, or explicit ambiguity applies)

§Artifact compatibility contract

The tailtriage analyze workflow expects a supported tailtriage run artifact with minimum required content.

Current contract:

  • top-level schema_version is required
  • missing schema_version is rejected
  • non-integer schema_version is rejected
  • unsupported schema_version is rejected
  • current supported schema version is 1
  • requests must contain at least one request event
  • artifacts with an empty requests array are rejected by the CLI loader

For Rust in-process usage, use tailtriage-analyzer directly (analyze_run, render_text, typed Report). The stricter non-empty requests rule applies to CLI artifact loading from disk. Loader, parse, validation, and render errors return a non-zero process exit through the CLI.

§Important interpretation notes

  • suspects are investigation leads, not proof of root cause
  • truncation warnings mean the diagnosis is based on partial retained data
  • unfinished lifecycle warnings printed by the CLI indicate some requests were not completed cleanly
  • p95_queue_share_permille and p95_service_share_permille are independent percentile summaries and do not need to sum to 1000

§Scoring and warning behavior

Suspect ranking uses deterministic, proportional, evidence-aware scoring (0-100), not fixed suspect priority.

  • Scores rank suspects inside one report; they are not probabilities.
  • Confidence is score-derived ranking strength and may be evidence-quality capped; it is not causal certainty.
  • confidence_notes[] explain caps, including sparse samples, truncation, missing instrumentation, ambiguous top scores, and partial-vs-missing runtime snapshot limits.
  • Strong downstream tail-stage contribution can outrank weak blocking/runtime signals.
  • Strong queue pressure remains a high-confidence lead when queue share/depth evidence is dominant.

How to read before/after runs:

  • Compare p95 latency movement first.
  • Confirm primary suspect kind/rank and evidence direction.
  • Use score movement as supporting context, not a standalone pass/fail rule.

Why a score can stay flat or rise after mitigation:

  • Scores are relative to the evidence mix in each capture.
  • If total latency drops but the remaining tail is still dominated by one suspect family, that suspect score can remain high or increase.
  • This does not by itself mean mitigation failed when p95 and relevant evidence improve.

warnings[] may include:

  • evidence-quality warnings (for example low request counts or missing signal families)
  • ambiguity warnings when top suspects are genuinely close after calibration
  • additive truncation warnings when capture limits drop events

§Suspect kinds

The current report surface includes these suspect kinds:

  • application_queue_saturation
  • blocking_pool_pressure
  • executor_pressure_suspected
  • downstream_stage_dominates
  • insufficient_evidence

§When the result is insufficient_evidence

Usually the next step is to add more structure to capture:

  • add queue wrappers around suspected waits
  • add stage wrappers around suspected downstream work
  • optionally add runtime sampling if runtime pressure is unclear
  • re-run under comparable load

§What this tool does not do

tailtriage-cli does not capture instrumentation data.

Use capture-side crates for that:

  • tailtriage: recommended capture-side entry point
  • tailtriage-core: direct instrumentation primitives
  • tailtriage-controller: repeated bounded windows
  • tailtriage-tokio: runtime-pressure sampling
  • tailtriage-axum: Axum request-boundary integration tailtriage-cli is the command-line artifact loader and report emitter. For in-process Rust analysis/report APIs, use tailtriage-analyzer.

Modules§

artifact
Artifact loading and validation helpers for CLI workflows.