asrch 0.1.1 - Docs.rs

# CLI Behavior

[日本語](cli_behavior.md)

This document defines the external behavior of `asrch`. The project goals and non-goals are defined in [goals_and_condition.en.md](goals_and_condition.en.md).

Counts are matching-line counts returned by `ripgrep`, not counts of individual match occurrences. Multiple occurrences on one line count as one match.

## Search Protocol

Agents should not combine multiple terms into a broad OR regex and then read raw matches. They should explore in this order:

1. Use `survey` to compare multiple candidate terms.
2. Choose one useful term and use `scout` to inspect its distribution.
3. Select a file and use `sample` to view nearby match cluster ranges.
4. Use `show --line` when needed to inspect context around a specific line.

## Commands

| Command | Input | Purpose | Default Output Budget |
| --- | --- | --- | --- |
| `survey` | Multiple fixed terms and candidate paths | Compare per-term match counts, file counts, and promising paths | Structural bounds |
| `scout` | Single query | Summarize match count, file count, top directories, and top files | Structural bounds |
| `sample` | Single query and explicit file | Cluster nearby matches and page through ranges with short context | Structural bounds |
| `show` | Single query and explicit file | Show small snippets, or context around a specific line | Structural bounds |

`count` and `terms` are not provided. `count` overlaps with `scout`, and simple frequency-based `terms` output tends to recommend noisy follow-up terms when the initial search is broad.

There are no line-count or byte-count hard caps. Output is structurally bounded: `survey` accepts at most 12 terms and 8 paths, `scout` prints only top directories and files, `sample` prints at most 5 clusters per page from one explicit file, and `show` prints at most 20 matching lines with at most 5 context lines. Every output line is clipped to 800 bytes.

Of `scout`, `sample`, and `show`, only `scout` defaults its path to the current directory (`.`). `sample` and `show` require an explicit file. Search mode options are mutually exclusive.

## Multiple Terms and OR

`survey` accepts up to 12 `--term` values and up to 8 candidate paths. It searches each term/path pair independently and does not print match bodies.

The path limit counts path arguments as written on the command line. Repeating the same path counts as multiple path arguments, and its results are aggregated repeatedly. The default path is `.`. If both term and path limits are exceeded, both violations are reported together.

`survey` uses a compact TOON-style output. It includes both `overall[term,matches,files,dominant_path]` and `by_path`. `overall` shows the total match count, total file count, and most active path for each term. `by_path` shows the per-path term distribution and omits zero-match rows. This lets agents choose the next term and path without re-aggregating the result in reasoning.

`scout` also uses TOON-style output. Metadata is under `scout:`, and distribution rows are emitted as `top_directories[path,matches]` and `top_files[path,matches]`. Detailed rows in `survey` by-path sections and `scout` distributions are displayed relative to the searched path to avoid repeating long absolute prefixes. `sample` and `show` keep their snippet-oriented format for readability.

`scout` prints the top 5 directories and top 5 files in descending matching-line count. Ties are ordered lexicographically by path. Directory counts aggregate each matching file's matching-line count by parent directory. Detailed paths are made relative to the searched path when possible and retain their original form otherwise.

TOON-like values are unquoted when they contain only ASCII alphanumeric characters and `_`, `-`, `.`, `/`, or `=`. Other values are enclosed in double quotes, with `"` and `\` escaped.

If one term accounts for at least 80% of all matching lines, `survey` warns that the term dominates the comparison. In fixed-string mode, terms of 3 characters or fewer produce a warning suggesting `--identifier` or `--word`.

`scout`, `sample`, and `show` reject queries containing an unescaped `|` when `--regex` is used. The CLI does not try to fully parse arbitrary regexes and split OR expressions into search plans.

## Search Modes

Queries are fixed strings by default.

- No option: fixed-string substring search
- `--identifier`: ASCII identifier-boundary fixed-string search
- `--word`: word-boundary fixed-string search
- `--regex`: explicit regex search; not accepted by `survey`

Prefer `--identifier` or `--word` for short or common terms to avoid unintended partial matches.
Empty queries are rejected.

## `sample` Selection

`sample` is deterministic; it does not use random sampling.

`sample` accepts only one explicit file path and rejects directories.

1. Treat matches within 2 lines as one cluster.
2. For each cluster, show its line range, hit count, first match, and last match.
3. Use `--clusters N` to choose 1 to 5 clusters per page. The default is 3.
4. Use `--page N` to display page N. Clusters are paged in file order.
5. Show one line of context around the first and last match in each cluster. If both are on the same line, show the snippet once.

`sample` does not accept `--context`. Control output size with `--clusters` and `--page`.

Matches are processed in `ripgrep --sort path` order, so cluster order and page contents are deterministic for the same input. A page beyond the last page is not an error; it prints an empty-page message and the valid page range.

## `show` Constraints

`show` accepts only an explicit file path and rejects directories. If there are more than 20 matching lines, or if the internal scan limit is reached, `show` refuses to print snippets and asks the agent to narrow the query.

`show --line N` prints context around a specific line instead of enumerating matching snippets. In this mode, `show` does not reject the command just because the query has more than 20 matches. Use it to inspect the middle of a cluster range reported by `sample`.

The selected line does not need to match the query. `N` must be at least 1 and no greater than the file's final line. `--context N` selects that many lines before and after the target, accepts 0 through 5, and defaults to 2.

`show` writes output by snippet. It does not stop midway based on a line or byte budget; output is controlled by the number of matching lines and the context size.

## Broad Searches

If a query has more than 1,000 matching lines or more than 100 matching files, the command reports that the query is broad. If the internal scan limit is reached, scanning stops and the count is reported as `at least`.

The internal scan limit is 50,000 matching lines per search. At most the first 5,000 matches are retained for snippet generation. `survey` and `scout` counts use every scanned match. `sample` builds clusters and pages only from the retained first 5,000 matches and asks the user to narrow the query or path when more matches exist. `show` without `--line` refuses searches above 20 matches, so the storage limit cannot silently omit its snippets.

When the scan limit is reached, `survey` formats counts as `>=N`; `scout`, `sample`, and `show` use `at least N`.

Agents should narrow the query or path instead of increasing output limits.

## Errors and Exit Status

Unknown subcommands and options print `unknown command` or `unknown option`, respectively, followed by help on standard error. Invalid input, including missing required arguments, numeric range violations, and passing a directory where an explicit file is required, exits with a nonzero status.

When `show` rejects a broad search, it also exits with a nonzero status and prints no snippets. A search with zero matches is not an error.

## Default Excludes

Noisy targets are excluded by default through `rg` globs and ignore rules:

- `.git`, `target`, `node_modules`, `vendor`
- `dist`, `build`, `coverage`, `generated`
- `scratch`, `tmp`
- `*.log`, `*.jsonl`, `*.xml`, `*.min.js`, `*.map`

There is no CLI option to disable these default excludes. The default favors safe agent exploration over easy access to noisy generated output.