aicx 0.6.6

Operator CLI + MCP server: canonical corpus first, optional semantic index second (Claude Code, Codex, Gemini)
Documentation
# Distillation (Chunking)

`aicx` distills raw timelines into chunked “agent-readable” context files.

This is the piece that makes the tool useful for:
- downstream retrieval/indexing,
- fast onboarding for new agents,
- avoiding “paste 4000 lines of history” prompts.

Implementation lives in `src/chunker.rs`.

## Chunk Model

The chunker processes `TimelineEntry` streams and produces `Chunk` items:
- token estimate uses a simple heuristic: `tokens ≈ ceil(chars / 4)`
- target size defaults to ~1500 tokens with overlap (2 messages)
- extremely long messages are UTF-8 safely truncated (4000 bytes) in the chunk text

The output text format is stable and line-oriented:

```text
[project: <project> | agent: <agent> | date: <YYYY-MM-DD>]

[signals]
...
[/signals]

[HH:MM:SS] <role>: <message>
[HH:MM:SS] <role>: <message>
...
```

The `[signals]` block is optional and only appears when the chunk contains at least one
high-signal marker (checklist items, intent lines, results, or keyword-based highlights).

## Tuning Knobs

Defaults (see `ChunkerConfig::default()` in `src/chunker.rs`):
- `target_tokens=1500`
- `min_tokens=500`
- `max_tokens=2500`
- `overlap_messages=2`

Practical guidance:
- Increase `target_tokens` if your queries need longer local context.
- Decrease `overlap_messages` if your store grows too fast.
- Keep `max_tokens` bounded to avoid “monster chunks” that get expensive to embed and hard to retrieve.

## Signals + Highlight Extraction

Chunks compute lightweight “signals” and “highlights” (see `extract_signals` and
`extract_highlights` in `src/chunker.rs`).

Signals (persisted in the chunk text as `[signals]...[/signals]`):
- TODO checklist extraction (`- [ ]`, `- [x]`) with a prominent "RED LIGHT" marker when open items exist
- intent-line extraction (example: "mam taki pomysl, zeby ...")
- result-line extraction (example: "SMOKE TEST PASSED", "0 failed")
- tag vicinity extraction for Claude-style markers:
  - `Ultrathink` (typically indicates important decisions)
  - `Insight` (important implementation discoveries)
- plan-mode session signals:
  - `Plan mode` (session start / context)
  - `User accepted the plan` / `Accept plan`
  - `Approve and bypass permissions`
- optional inclusion of keyword-based highlights as "Notes"

Highlights (programmatic only):
- extract up to 3 first lines that match keyword heuristics (e.g. `Decision:`, `TODO:`, `Plan:`)
- are stored on the in-memory `Chunk` struct as `highlights: Vec<String>`

Note: "highlights" are currently included in the `[signals]` block as "Notes" for
agent visibility, but they are also kept separately on the `Chunk` struct for tooling.

## Efficiency Notes (Where To Improve)

The current chunker is correct and tested, but there are clear performance wins for very large timelines:

1. Avoid allocating a new `Vec<&TimelineEntry>` for every chunk window.
2. Precompute per-message token estimates once per day and reuse them across overlapping windows.
3. Replace the per-date `BTreeMap` grouping with a single-pass scan of already-sorted entries.
4. Reduce per-chunk string work in hot paths (chunk IDs and text builders).

If you implement these, keep tests in `src/chunker.rs` as the behavioral contract.