# `sel` — Definitive Technical Reference
> A comprehensive reference to the design, internals, and library API of
> **`sel`** (Select Slices from Text Files). This document complements, not
> replaces, the short user-facing docs:
> [`README.md`](../README.md), [`USAGE.md`](USAGE.md),
> [`ARCHITECTURE.md`](ARCHITECTURE.md), and the rustdoc on
> [docs.rs/sel-rs](https://docs.rs/sel-rs).
- **Package**: `sel`
- **Version documented**: 0.2.0
- **Edition**: Rust 2024 (MSRV 1.92)
- **License**: MIT OR Apache-2.0
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [System Overview](#2-system-overview)
3. [Design Decisions](#3-design-decisions)
4. [The Pipeline in Detail](#4-the-pipeline-in-detail)
5. [Core Data Types](#5-core-data-types)
6. [Module Reference](#6-module-reference)
7. [Selector Grammar](#7-selector-grammar)
8. [Matchers](#8-matchers)
9. [Context Expansion](#9-context-expansion)
10. [Formatters & ANSI](#10-formatters--ansi)
11. [Sources & Sinks](#11-sources--sinks)
12. [Error Model](#12-error-model)
13. [CLI Wiring](#13-cli-wiring)
14. [Library API & Embedding](#14-library-api--embedding)
15. [Extension Points](#15-extension-points)
16. [Performance Characteristics](#16-performance-characteristics)
17. [Testing Strategy](#17-testing-strategy)
18. [Build, Release & CI](#18-build-release--ci)
19. [Troubleshooting & Pitfalls](#19-troubleshooting--pitfalls)
20. [Glossary](#20-glossary)
21. [Appendix A — File Index](#appendix-a--file-index)
22. [Appendix B — Reading Paths](#appendix-b--reading-paths)
---
## 1. Executive Summary
`sel` is a small (~2 kLoC) Rust crate that provides a CLI for extracting
ranges, positions, or regex matches from text streams, plus a library that
exposes the same pipeline for embedding in other Rust tools.
It is designed around four invariants:
1. **Streaming** — one `BufReader` pass over the input; memory usage is
independent of file size. A 100 GB log and a 10 kB config travel the same
path.
2. **One pipeline** — every invocation (`sel 10-20`, `sel -e ERROR`,
`sel file.txt`) produces the same five-stage pipeline:
`Source → Matcher → Expander → Formatter → Sink`.
3. **Typestate builder** — invalid pipelines (positional selector on stdin)
fail at **compile time** inside the crate, and with a helpful message at
**parse time** from the CLI.
4. **No heavy dependencies** — three runtime crates (`clap`, `regex`,
`thiserror`); no `anyhow`, no `termcolor`, no `is-terminal`. Color is a
handful of ANSI codes in [`format/ansi.rs`](../src/format/ansi.rs).
The rest of this document walks through those ideas and the code that
realizes them.
---
## 2. System Overview
### 2.1 What `sel` does
Given a text source (a file, stdin, or multiple files), `sel` emits a subset
of the lines, optionally with surrounding context, optionally with
highlighting, to a sink (stdout or a file).
The subset is one of:
| **All** | bare file, no selector | `sel file.txt` |
| **Line numbers** | digits, commas, hyphens | `sel 1,5,10-20 file.txt` |
| **Positions** | `line:column` | `sel -n 10 23:260 file.txt` |
| **Regex** | `-e PATTERN` | `sel -e TODO src/*.rs` |
| **Invert regex** | `-v -e PATTERN` | `sel -v -e '^\s*#' cfg.ini` |
### 2.2 Crate topology
```
sel (crate)
├── binary `sel` (src/main.rs) <─┐
│ │ both go through
└── library `sel` (src/lib.rs) │ pipeline::run
├── cli.rs (argv → App) <──┘
├── app.rs (typed builder)
├── pipeline.rs (driver)
├── source/ (file, stdin)
├── matcher/ (all, lines, position, regex)
├── context.rs (no-context, line-context)
├── format/ (plain, fragment, ansi)
├── sink/ (stdout, file)
├── selector.rs (parser)
├── types.rs (Line, MatchInfo, Emit, Role)
└── error.rs (SelError)
```
The CLI is a very thin wrapper around the library: `main.rs` is **30 lines**
that iterate over file arguments, ask `Cli` to build an `App` for each, and
hand it to `pipeline::run`.
### 2.3 Data flow at a glance
```text
argv
│
▼
Cli::parse ──► Cli::validate ──► for each file ──► Cli::into_app_for_*
│
▼
App<K: SourceKind>
│
▼
┌───────────────────────── pipeline::run ──────────────────────────┐
│ │
│ Source ──► Matcher ──► Expander ──► Formatter ──► Sink │
│ (Line) (MatchInfo) (EmitOwned) (bytes) (flush) │
│ │
└──────────────────────────────────────────────────────────────────┘
```
Each stage is a trait object. Stages communicate through a tiny value
vocabulary: [`Line`](../src/types.rs), [`MatchInfo`](../src/types.rs),
[`EmitOwned`](../src/context.rs), and [`Emit`](../src/types.rs).
---
## 3. Design Decisions
This section documents the **why**. The **what** is in subsequent chapters.
### 3.1 Why a single pipeline, not a dispatch tree?
Earlier revisions of `sel` had one control-flow path per mode (line mode,
position mode, regex mode). Adding a new flag or context rule meant editing
all paths. The 0.2 refactor collapsed them into one loop:
```rust
// src/pipeline.rs
while let Some(line) = app.source.next_line()? {
let info = app.matcher.match_line(&line);
app.expander.push(line, info, &mut |emit| {
formatter.write(sink.as_mut(), &emit.borrow())
});
}
app.expander.drain(...);
sink.finish()?;
```
One loop, one place to change iteration semantics, one place to reason about
ownership. New features are almost always a swap of one stage.
See [`src/pipeline.rs:8`](../src/pipeline.rs) for the driver.
### 3.2 Why a typestate builder?
A positional selector (`line:column`) is meaningful only on a **seekable**
source; asking stdin for "column 260 on line 23" is nonsense because a pipe
cannot be addressed. The crate could detect this at runtime, but it is also
trivially representable in the type system:
- `Seek` and `NonSeek` are marker types implementing `SourceKind`.
- `Seekable: SourceKind` is a further marker that only `Seek` implements.
- `Stage2::with_position_matcher` is defined **only for `Stage2<Seek>`**.
Concrete effect: the Rust compiler refuses to compile
`StdinSource → PositionMatcher`. The CLI mirrors this at runtime with the
`SelError::PositionalWithStdin` variant for users who hit it.
See [`src/app.rs:11`](../src/app.rs) for the markers and
[`src/app.rs:66`](../src/app.rs) for the seek-restricted stage.
### 3.3 Why trait objects (`Box<dyn Trait>`)?
Stages are swappable at runtime based on flags. Static dispatch would
explode into a combinatorial number of monomorphizations (matcher × expander
× formatter × sink). The streaming pipeline is I/O-bound, so vtable overhead
is invisible. Trait objects give us:
- ~30 lines of pipeline driver code instead of generic gymnastics,
- simple CLI wiring in `cli.rs`,
- easy third-party extensibility (implement a trait, pass a `Box<dyn …>`).
### 3.4 Why strip dependencies?
`anyhow`, `termcolor`, and `is-terminal` were dropped between 0.1 and 0.2.
- **`anyhow`** — error domain is small and well-enumerated; `thiserror`
gives structured errors that tests can match on.
- **`termcolor`** — `sel` emits three ANSI codes (green, inverse, reset).
A 14-line module (`format/ansi.rs`) replaces the dependency.
- **`is-terminal`** — stabilised in `std::io::IsTerminal`; no need for the
crate.
The result: faster builds, less surface area to vet, zero behavioural loss.
### 3.5 Why normalize selectors?
Users can (and do) write `sel 1,2,3,10-20,15,16 file.txt`. Left raw, each
line would be checked against six specs. The `Selector::normalize()` step
merges into `[1-3, 10-20]`:
- Two ranges instead of six specs → fewer comparisons per line.
- Predictable behaviour when specs overlap.
- Stable ordering for tests and snapshot comparisons.
`LineMatcher::from_selector` always runs `normalize()` first
([`src/matcher/lines.rs:15`](../src/matcher/lines.rs)).
### 3.6 Why bytes, not `&str`, for lines?
Input files are not guaranteed to be valid UTF-8. `Source` yields
`Line { no, bytes: Vec<u8> }`. The regex crate's `bytes::Regex` matches
against `&[u8]` directly; formatting uses `String::from_utf8_lossy` to
surface malformed UTF-8 as U+FFFD without crashing.
This matters for grepping binary-ish logs, Windows UTF-16 accidentally
piped in, or files with shift-JIS regions.
### 3.7 Why `Role` (Target vs Context)?
The formatter needs to know whether to prepend `"> "` (green target marker)
and whether the `match_info.spans` are meaningful (context lines inherit
the context-window, not the hit's spans). `Role::Target | Role::Context`
keeps that information on the `Emit` itself so formatters stay stateless.
---
## 4. The Pipeline in Detail
### 4.1 Stage contracts
| Source | `Source` | — | `Option<Line>` |
| Matcher | `Matcher` | `&Line` | `MatchInfo` |
| Expander | `Expander` | `Line, MatchInfo` | zero or more `EmitOwned` |
| Formatter | `Formatter` | `&Emit` | bytes to sink |
| Sink | `Sink: Write` | bytes | flushed I/O |
Ownership moves forward down the stages:
- The source owns the I/O handle.
- `Line::bytes` is a fresh `Vec<u8>` per call (no shared buffer — a
deliberate simplification that keeps the expander's buffering trivial).
- The expander owns the `Line` until it hands it to the formatter inside
an `EmitOwned`.
- The formatter borrows the emit and writes bytes to a `Sink`.
### 4.2 Walkthrough: `sel -c 2 -e ERROR file.log`
1. **CLI**: `Cli::parse` captures `regex = "ERROR"`, `context = Some(2)`;
`get_files()` returns `["file.log"]`.
2. **Build**: `into_app_for_file` returns
`App<Seek> { source: FileSource, matcher: RegexMatcher, expander: LineContext(2), formatter: PlainFormatter, sink: StdoutSink }`.
3. **Run**: `pipeline::run` loops. For each line:
- `FileSource::next_line` yields `Line { no, bytes }`.
- `RegexMatcher::match_line` returns `MatchInfo { hit, spans, col: None }`.
- `LineContext::push` either:
- Buffers the line (miss, no active trailing window).
- Flushes the "before" buffer and emits the hit + starts a trailing
window of 2.
- Emits the line as context if we're within a trailing window.
4. **Output**: `PlainFormatter::write` prints `[prefix]content\n`, with
green `"> "` on target lines because `target_marker` is on.
5. **Finish**: `sink.finish()` flushes the `BufWriter`.
### 4.3 The `Emit` / `EmitOwned` split
`EmitOwned` carries a `Line` by value through the expander's callback so
the expander can hold onto (or discard) buffered lines. The formatter
receives a borrowed `Emit<'_>` because it only needs to read. Conversion is
cheap: `EmitOwned::borrow` constructs the borrowed view in place
([`src/context.rs:27`](../src/context.rs)).
### 4.4 Back-pressure & streaming
`sel` has no explicit back-pressure; the `BufWriter` on the sink and the
OS write() call provide flow control naturally. Memory high-water mark is:
- `Line::bytes` currently in flight (one line).
- `before` ring buffer of at most `n` lines in `LineContext` where `n` is
the `-c` value.
- `spans: Vec<Range<usize>>` for regex hits on the current line.
For a 100 GB file with `-c 3` and no ridiculous lines, peak memory stays
in the tens of kilobytes.
---
## 5. Core Data Types
### 5.1 `Line`
```rust
// src/types.rs:7
pub struct Line {
pub no: u64, // 1-indexed line number
pub bytes: Vec<u8>, // line contents, newline stripped (both \n and \r\n)
}
```
Produced by sources, consumed by matchers (by reference), and carried
through the expander until emitted. Newline stripping is the source's
responsibility; both `\n` and `\r\n` are handled
([`src/source/file.rs:51`](../src/source/file.rs)).
### 5.2 `MatchInfo`
```rust
// src/types.rs:24
pub struct MatchInfo {
pub hit: bool,
pub spans: Vec<Range<usize>>, // byte ranges for regex highlighting
pub col: Option<usize>, // 1-indexed target column (positions)
}
```
`MatchInfo::default()` is a miss. Populated by each matcher according to
its nature: regex sets `spans`, position sets `col`, line matchers set
neither.
### 5.3 `Role`
```rust
// src/types.rs:36
pub enum Role { Target, Context }
```
Whether this emit is the line that matched (`Target`) or a context
neighbour (`Context`). Formatters consult this to decide on markers and
span highlighting.
### 5.4 `Emit<'a>` and `EmitOwned`
```rust
// src/types.rs:45
pub struct Emit<'a> {
pub line: &'a Line,
pub role: Role,
pub match_info: &'a MatchInfo,
}
// src/context.rs:19
pub struct EmitOwned {
pub line: Line,
pub role: Role,
pub match_info: MatchInfo,
}
```
`EmitOwned` is what flows through the expander's callback; `Emit` is what
formatters see.
---
## 6. Module Reference
Concise role summary for every public module. Source files are at
`src/<path>` in the repo.
| `lib.rs` | Crate root, re-exports | `App`, `Selector`, `Result`, `run` |
| `main.rs` | Binary entry point | — |
| `cli.rs` | `clap` definitions + CLI → App wiring | `Cli`, `ColorMode` |
| `app.rs` | Typestate builder | `App<K>`, `Stage1..5`, `Seek`, `NonSeek` |
| `pipeline.rs` | Single driver `run::<K>(App<K>)` | `run` |
| `selector.rs` | Selector parser & normalizer | `Selector`, `LineSpec`, `Position` |
| `context.rs` | Emit planner with optional context | `Expander`, `NoContext`, `LineContext` |
| `types.rs` | Shared value types | `Line`, `MatchInfo`, `Role`, `Emit` |
| `error.rs` | Typed errors | `SelError`, `Result` |
| `matcher/mod.rs` | `Matcher` trait + `AllMatcher` | `Matcher`, `AllMatcher` |
| `matcher/lines.rs` | Line-range matcher | `LineMatcher` |
| `matcher/position.rs` | `L:C` matcher (seekable only) | `PositionMatcher` |
| `matcher/regex.rs` | `regex::bytes::Regex` wrapper | `RegexMatcher` |
| `source/mod.rs` | `Source` trait | `Source` |
| `source/file.rs` | File-backed source | `FileSource` |
| `source/stdin.rs` | Stdin-backed source | `StdinSource` |
| `format/mod.rs` | `Formatter` trait + options | `Formatter`, `FormatOpts` |
| `format/plain.rs` | Line-oriented formatter | `PlainFormatter` |
| `format/fragment.rs` | Char-context fragment formatter | `FragmentFormatter` |
| `format/ansi.rs` | Three ANSI codes + `paint()` | `GREEN`, `INVERSE`, `RESET` |
| `sink/mod.rs` | `Sink` trait | `Sink` |
| `sink/stdout.rs` | Buffered stdout sink | `StdoutSink` |
| `sink/file.rs` | Create-new/force file sink | `FileSink` |
---
## 7. Selector Grammar
### 7.1 Informal grammar
```
line_list ::= line_spec ("," line_spec)*
line_spec ::= number | number "-" number
pos_list ::= position ("," position)*
position ::= number ":" number
number ::= [1-9][0-9]* # must be >= 1
```
Mixing line specs and positions in the same selector is **rejected**; the
parser picks one variant based on whether the string contains `:`.
### 7.2 Semantics
- **Ranges are inclusive** (`10-20` selects lines 10..=20).
- **Line numbers are 1-indexed**. `0` is rejected.
- **Column numbers are 1-indexed** and measured in **bytes** (important for
multi-byte UTF-8).
- **Reversed ranges** (`20-10`) are rejected.
- **Duplicates** are removed; adjacent/overlapping ranges are merged via
`Selector::normalize()`.
### 7.3 What is "selector-ish"?
`Cli::looks_like_selector` (at [`src/cli.rs:140`](../src/cli.rs)) defines a
conservative test so the first positional argument can be either a selector
or a filename without ambiguity:
- Must contain at least one digit.
- Must contain only digits, commas, colons, hyphens.
- Colons must separate non-empty number pairs.
- `-` alone is the stdin sentinel, not a selector.
Filenames like `file.txt`, `10-20.log`, or `23:notes` are **not** matched
because they contain letters or dots.
### 7.4 Example normalizations
| `1,5,10-15,14` | `1, 5, 10-15` |
| `1,2,3` | `1-3` |
| `1-5,6-10` | `1-10` (adjacent merge) |
| `1-5,3-10` | `1-10` (overlap merge) |
| `5,5,5` | `5` |
| `1-5,10-15` | `1-5, 10-15` (no merge) |
Positions are de-duplicated and sorted, but not merged (each position is
point-valued).
---
## 8. Matchers
Every matcher implements:
```rust
pub trait Matcher {
fn match_line(&mut self, line: &Line) -> MatchInfo;
}
```
The `&mut self` permits stateful matchers like `PositionMatcher` (which
keeps a cursor into its sorted `positions` list).
### 8.1 `AllMatcher`
Trivial: every line is a hit. Used when the user runs `sel file.txt` with
no selector and no regex.
### 8.2 `LineMatcher`
Owns a `Vec<(u64, u64)>` of merged inclusive ranges. Each line is checked
with a linear scan; for realistic selectors (one to tens of ranges after
normalization), this is faster than a `BTreeSet` or binary search because of
cache locality and branch prediction. See the benchmark
[`benches/large_file.rs:77`](../benches/large_file.rs) for the comparison.
### 8.3 `PositionMatcher`
Carries a sorted, deduped `Vec<Position>` and a `cursor` index. On each
line it advances the cursor past stale entries (positions with `line < current`),
then peeks at `positions[cursor]` to see if it matches the current line.
This is effectively a two-pointer merge of input line numbers (monotonic)
and target positions (also monotonic after `sort`), giving **O(n + p)**
where `n` is the number of input lines and `p` is the number of positions.
### 8.4 `RegexMatcher`
Wraps `regex::bytes::Regex` so it can match UTF-8-invalid bytes. When the
match is a hit (and not inverted), it collects all `Match` spans into
`MatchInfo.spans` for the formatter to highlight. Inverted matches have no
spans (there is nothing to highlight).
The regex dialect is the standard Rust `regex` crate syntax — not PCRE, but
similar; lookaround and backreferences are unsupported by design (linear
time guarantee).
---
## 9. Context Expansion
### 9.1 Two expanders
- **`NoContext`** — passthrough. Emits only hits with `Role::Target`.
- **`LineContext(n)`** — for each hit, emits up to `n` preceding "before"
lines, the hit itself, then up to `n` trailing lines. Windows are merged
automatically: if two hits are within `2n+1` of each other, their context
regions overlap and the expander deduplicates by line number.
### 9.2 Why a ring buffer?
`LineContext` maintains a `VecDeque<(Line, MatchInfo)>` of the last `n`
lines seen. On a hit, it flushes the deque as `Role::Context` then emits
the hit, then switches to "trailing" mode counting down `n` lines that
will be emitted directly as `Context`. This avoids any seek or two-pass
I/O, which is crucial for stdin support.
### 9.3 Overlap handling
`LineContext` tracks `last_emitted: u64` (highest line number emitted so
far) and refuses to re-emit any line number at or below it. This makes
overlap-merging free: two hits `h1` and `h2` with `h2 - h1 <= n` simply
produce a continuous band `[h1 - n, h2 + n]` without duplicates.
Unit tests in [`src/context.rs:118`](../src/context.rs) exercise these
cases (no-dup overlap, symmetric around a hit, drain behaviour).
### 9.4 EOF and `drain`
On EOF, the driver calls `expander.drain(callback)`. For `NoContext` it is
a no-op. For `LineContext`, trailing lines have already been emitted as
they arrived, and any still-buffered "before" lines (those that never
turned into context for a hit) are dropped — they do not belong in the
output.
---
## 10. Formatters & ANSI
### 10.1 `FormatOpts`
```rust
// src/format/mod.rs:20
pub struct FormatOpts {
pub show_line_numbers: bool,
pub show_filename: bool,
pub filename: Option<String>,
pub color: bool,
pub target_marker: bool, // "> " on Role::Target (context-aware output)
}
```
`FormatOpts::prefix(line_no)` produces `filename:line:` in shared style.
### 10.2 `PlainFormatter`
Writes one line per emit: `[marker][prefix]content\n`.
- `marker` is `"> "` painted green when `target_marker && role == Target`.
- `content` is the line bytes; if color is on and there are `spans`, each
span is wrapped in `INVERSE…RESET`.
See [`src/format/plain.rs`](../src/format/plain.rs); unit tests at the
bottom cover prefix, marker, and span painting.
### 10.3 `FragmentFormatter`
Active when `-n N` is present (positional selectors or regex-with-char-
context). Emits a byte-window `bytes[col-N .. col+N]` plus a caret line
aligned under the target column:
```
4:cdefg
^
```
If color is on and the current hit has regex spans, the span within the
fragment is highlighted. See [`src/format/fragment.rs`](../src/format/fragment.rs).
### 10.4 ANSI module
```rust
pub const GREEN: &str = "\x1b[32m";
pub const INVERSE: &str = "\x1b[7m";
pub const RESET: &str = "\x1b[0m";
pub fn paint(enabled: bool, code: &str, text: &str) -> String;
```
That is the entire color implementation. The 14-line file in
[`src/format/ansi.rs`](../src/format/ansi.rs) replaces what used to be a
dependency on `termcolor`.
### 10.5 Color detection
`Cli::color_mode()` handles user intent (`--color always|never|auto`).
Terminal detection uses `std::io::IsTerminal` (stable since Rust 1.70).
`cli.rs` passes a resolved boolean `color` into `FormatOpts`, so the
formatter does not re-detect.
---
## 11. Sources & Sinks
### 11.1 `Source` trait
```rust
// src/source/mod.rs:15
pub trait Source {
fn next_line(&mut self) -> Result<Option<Line>>;
fn label(&self) -> &str; // for filename prefix
fn is_seekable(&self) -> bool; // informational mirror of SourceKind
}
```
Both implementations (`FileSource`, `StdinSource`) use a `BufReader` and
`read_until(b'\n', …)` so lines are discovered without requiring UTF-8
validity. Trailing `\r\n` is normalized to Unix-style.
### 11.2 `FileSource`
Owns a `BufReader<File>` plus the `PathBuf`. I/O errors are wrapped as
`SelError::Io { path, source }` so the offending file shows up in the
error message. `is_seekable()` returns `true` — the crate's typestate
ensures `Seek` only attaches to this variant.
### 11.3 `StdinSource`
Locks stdin by `Box::leak`ing the `Stdin` handle for the lifetime of the
process and calling `.lock()`, producing a `StdinLock<'static>`. This lets
the `BufReader` own the lock without fighting lifetime parameters. The
trade-off is a single intentional leak of one small handle, which is fine
because there's exactly one stdin per process and it lasts as long as the
pipeline.
`is_seekable()` returns `false`; positional selectors on stdin are
rejected at two layers (compile-time `NonSeek` type and runtime
`SelError::PositionalWithStdin`).
### 11.4 `Sink` trait
```rust
// src/sink/mod.rs:12
pub trait Sink: Write {
fn is_terminal(&self) -> bool;
fn finish(self: Box<Self>) -> io::Result<()>; // flush + surface errors
}
```
The `finish` takes `Box<Self>` so the pipeline can consume the sink by
value at the end of `run()`.
### 11.5 `StdoutSink`
Locks stdout with the same leak-lock trick as `StdinSource` and wraps it
in a `BufWriter` (64 KiB). `is_terminal()` is cached at construction time.
### 11.6 `FileSink`
Opens the output path with either `create_new(true)` (default) or
`create(true).truncate(true)` (when `force`). An existing file without
`--force` yields `SelError::OutputExists(path)` which in turn becomes:
```
Error: output file already exists: out.txt (use --force to overwrite)
```
### 11.7 Interaction summary
| file arg | `FileSource` | `StdoutSink` | default |
| `-o out` | `FileSource` | `FileSink` | fails if `out` exists |
| `-o out --force` | `FileSource` | `FileSink` | truncates `out` |
| `-o -` | `FileSource` | `StdoutSink` | explicit stdout |
| no file | `StdinSource` | `StdoutSink` | classic pipe |
| `-` as file | `StdinSource` | `StdoutSink` | same as above |
---
## 12. Error Model
### 12.1 `SelError` variants
```rust
// src/error.rs
pub enum SelError {
InvalidSelector(String),
InvalidRegex(String),
PositionalWithStdin,
InvertWithoutRegex,
CharContextWithoutTarget,
Io { path: String, source: io::Error },
OutputExists(PathBuf),
}
```
All variants produce human-readable messages via `thiserror`'s `#[error(...)]`.
Every `Io` error always carries the offending path, so messages like
`no such file or directory` become `logs/missing.log: no such file or
directory`.
### 12.2 CLI exit behaviour
`main` prints `Error: {e}` to stderr and exits with status 1 on any
`SelError`. `Cli::validate` catches CLI-level conflicts early (before
building a pipeline) so the user sees a crisp message for
`--invert-match` without `-e`, or `--char-context` without a target.
### 12.3 Library users
Library users receive `sel::Result<T>` and can `match` on `SelError`:
```rust
match sel::pipeline::run(app) {
Err(sel::SelError::OutputExists(p)) => { /* prompt the user */ }
Err(e) => { /* generic fallback */ }
Ok(()) => {}
}
```
---
## 13. CLI Wiring
### 13.1 `Cli` struct
All flags are in [`src/cli.rs:24`](../src/cli.rs). Derived via `clap` with
`#[derive(Parser)]`; no manual `App::new` boilerplate.
Notable methods:
- `get_selector()` — returns `Some(raw)` if `args[0]` looks like a selector
and we are not in regex mode.
- `get_files()` — returns the file list (falls back to `["-"]` if none).
- `validate()` — checks flag conflicts early.
- `color_mode()` — resolves `--color` against TTY detection.
- `into_app_for_file(path, show_filename)` / `into_app_for_stdin(show_filename)`
— the heart of the wiring, shown below.
### 13.2 The wiring function
```rust
// Simplified version of cli.rs:257
pub fn into_app_for_file(&self, path, show_filename) -> Result<App<Seek>> {
let source = FileSource::open(path)?;
let sink = self.make_sink()?;
let color = self.resolve_color(sink.is_terminal());
let opts = FormatOpts { /* show_line_numbers, filename, color, ... */ };
let stage2 = Stage1::with_seekable_source(Box::new(source));
let stage3 = if regex.is_some() {
stage2.with_matcher(Box::new(RegexMatcher::new(pat, invert)?))
} else if let Some(raw) = self.get_selector() {
match Selector::parse(&raw)? {
Selector::All => stage2.with_matcher(Box::new(AllMatcher)),
Selector::LineNumbers => stage2.with_matcher(Box::new(LineMatcher::from_selector(&sel))),
Selector::Positions => stage2.with_position_matcher(PositionMatcher::from_selector(&sel)),
}
} else {
stage2.with_matcher(Box::new(AllMatcher))
};
let stage4 = match self.context {
Some(n) if n > 0 => stage3.with_expander(Box::new(LineContext::new(n))),
_ => stage3.with_expander(Box::new(NoContext)),
};
let stage5 = if let Some(n) = self.char_context {
stage4.with_formatter(Box::new(FragmentFormatter::new(opts, n)))
} else {
stage4.with_formatter(Box::new(PlainFormatter::new(opts)))
};
Ok(stage5.with_sink(sink))
}
```
The stdin variant is the same with `NonSeek` and an early `Err(PositionalWithStdin)`
if the user typed `line:col` against stdin.
### 13.3 Multi-file handling
`main.rs` iterates over `get_files()`. Each iteration builds a **fresh**
`App` and runs it through the pipeline:
```rust
// src/main.rs:19
fn run(cli: Cli) -> sel::Result<()> {
let files = cli.get_files();
let show_filename = cli.with_filename || files.len() > 1;
for path in &files {
if path.as_os_str() == "-" {
sel::pipeline::run(cli.into_app_for_stdin(show_filename)?)?;
} else {
sel::pipeline::run(cli.into_app_for_file(path, show_filename)?)?;
}
}
Ok(())
}
```
File prefix auto-activates when `files.len() > 1`, matching `grep`'s
behaviour. `-H` forces it on.
---
## 14. Library API & Embedding
`sel` is a regular crate on crates.io; `cargo add sel` makes everything
in `lib.rs` available.
### 14.1 The minimum embedding
```rust
use sel::{App, Stage1, NoContext, PlainFormatter, StdoutSink,
AllMatcher, FormatOpts};
use sel::source::FileSource;
fn cat_all(path: &std::path::Path) -> sel::Result<()> {
let source = FileSource::open(path)?;
let opts = FormatOpts {
show_line_numbers: true,
show_filename: false,
filename: None,
color: false,
target_marker: false,
};
let app = Stage1::with_seekable_source(Box::new(source))
.with_matcher(Box::new(AllMatcher))
.with_expander(Box::new(NoContext))
.with_formatter(Box::new(PlainFormatter::new(opts)))
.with_sink(Box::new(StdoutSink::new()));
sel::pipeline::run(app)
}
```
Every flag the CLI offers is achievable the same way: substitute a
different matcher, expander, formatter, or sink.
### 14.2 Parsing selectors from other input
```rust
use sel::Selector;
let sel = Selector::parse("1,5,10-20").unwrap().normalize();
```
`Selector::parse` is pure (no I/O); it's safe to call on user-supplied
strings, and its errors are `SelError::InvalidSelector(reason)`.
### 14.3 Custom matchers
Implement `sel::Matcher` — for example, a "match even line numbers"
matcher:
```rust
use sel::{Matcher, MatchInfo};
struct EvenOnly;
impl Matcher for EvenOnly {
fn match_line(&mut self, line: &sel::Line) -> MatchInfo {
MatchInfo { hit: line.no % 2 == 0, ..MatchInfo::default() }
}
}
```
Plug it into stage 2 just like the built-ins.
### 14.4 Custom sinks
Anything that is `Write + Sink` works. The sink trait requires two extra
methods beyond `Write` (`is_terminal`, `finish`). A common embedding
pattern is to write to an in-memory buffer for snapshot tests.
---
## 15. Extension Points
If you are adding a feature to `sel`, this is the shortest path:
| New CLI flag | Add field in [`cli.rs:24`](../src/cli.rs), extend `validate()` if needed, wire into `into_app_for_*`. |
| New matching mode | Implement `Matcher`. Decide if it needs seek → pick stage2 method. Wire into `cli.rs`. |
| New context strategy | Implement `Expander` (`push` + `drain`). |
| New output shape | Implement `Formatter` or extend `FormatOpts`. Consider adding a new variant instead of bit-flags. |
| New output destination | Implement `Sink` (which requires `Write`). |
| New error class | Add a variant to `SelError`. Compile errors guide you to every match site. |
For each change, add an integration test under `tests/` and update
`docs/USAGE.md` if it is user-facing.
---
## 16. Performance Characteristics
### 16.1 Complexity
- **Per-line work** is **O(1)** for all matchers in realistic configurations:
- `AllMatcher`: O(1).
- `LineMatcher`: O(r) where r = number of merged ranges (small in practice).
- `PositionMatcher`: amortized O(1) via two-pointer cursor.
- `RegexMatcher`: O(|line|) via the linear-time Rust regex.
- **Memory** is O(n + L) where n is `-c` and L is the longest line. No
whole-file buffering.
- **I/O** is one forward pass through `BufReader` (default 8 KiB buffer for
sources; 64 KiB for the sink).
### 16.2 Release profile
```toml
# Cargo.toml
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
strip = true
panic = "abort"
```
LTO plus single codegen unit plus panic=abort give tight, small binaries
(important for the `cargo-dist`-published artifacts on the Releases page).
### 16.3 Benchmarks
Criterion benches live in [`benches/large_file.rs`](../benches/large_file.rs):
- `selector_parsing` — parse cost for selectors from "single" to
"large_mixed".
- `selector_normalize` — cost of `Selector::normalize()` on overlapping /
adjacent / singleton cases.
- `line_matching` — pre-normalized vs. non-normalized range lookup per
line (quantifies the normalization win).
- `large_file` — parsing multi-thousand-line selectors.
Run with `cargo bench`. Reports land in `target/criterion/`.
### 16.4 Where bottlenecks live
In practice the hot path is:
1. `BufReader::read_until` — kernel I/O, fairly irreducible.
2. `Regex::is_match` (when `-e`) — dominates for regex workloads.
3. `String::from_utf8_lossy` in the formatter — only when actually writing.
There is currently no parallelism; the pipeline is single-threaded by
design (one file = one pass). Multi-file runs are processed sequentially.
---
## 17. Testing Strategy
### 17.1 Test layers
| **Unit** | `#[cfg(test)] mod tests` next to code | selector parsing, expander edge cases, ANSI painting |
| **Integration** | `tests/*.rs` | end-to-end CLI behaviour |
| **Benchmarks** | `benches/large_file.rs` | Criterion micro-benchmarks |
| **Property-based** | `proptest` dev-dependency | (available; used sparingly) |
### 17.2 Integration test files
| [`tests/basic.rs`](../tests/basic.rs) | Smoke tests, single line output |
| [`tests/selectors.rs`](../tests/selectors.rs) | Line numbers, ranges, lists |
| [`tests/positions.rs`](../tests/positions.rs) | `L:C` with/without `-n` |
| [`tests/regex.rs`](../tests/regex.rs) | `-e` patterns, highlighting |
| [`tests/invert.rs`](../tests/invert.rs) | `-v` negative matches |
| [`tests/context.rs`](../tests/context.rs) | `-c N` windows & merging |
| [`tests/stdin.rs`](../tests/stdin.rs) | `-` and bare-stdin, rejections |
| [`tests/multi_file.rs`](../tests/multi_file.rs) | Filename prefixes, multiple paths |
| [`tests/output_file.rs`](../tests/output_file.rs) | `-o`, `--force`, `-o -` |
### 17.3 Running tests
```bash
cargo test # everything, debug
cargo test --release # optional, catches release-only issues
cargo test --test regex # one integration file
cargo test selector::tests # module subset
cargo test -- --nocapture # stdout from tests
```
Integration tests run in parallel; they must use `tempfile::TempDir` for
any filesystem interaction.
### 17.4 Coverage philosophy
Error paths are covered in addition to happy paths because the error
messages are part of the user experience. For example,
`positional_with_stdin_has_clear_message` in [`src/error.rs`](../src/error.rs)
checks that the user-visible message contains the word "stdin".
---
## 18. Build, Release & CI
### 18.1 Build matrix
GitHub Actions runs a single matrix job on every push (see
`.github/workflows/ci.yml`) across Linux, macOS, and Windows. The job:
1. `cargo fmt --check`
2. `cargo clippy --all-targets -- -D warnings`
3. `cargo test`
### 18.2 Release pipeline
Tagging `vX.Y.Z` triggers `release.yml`, which is driven by
[`cargo-dist`](https://opensource.axo.dev/cargo-dist/). It:
1. Builds cross-platform binaries for the targets declared in
`Cargo.toml` `[workspace.metadata.dist]`:
- `aarch64-apple-darwin`, `x86_64-apple-darwin`
- `aarch64-unknown-linux-gnu`, `x86_64-unknown-linux-gnu`
- `x86_64-pc-windows-msvc`
2. Publishes a GitHub Release with shell and PowerShell installers.
3. Runs `cargo publish` to push the new version to crates.io.
The release workflow is manually tweaked to append the `cargo publish`
step; `allow-dirty = ["ci"]` in `Cargo.toml` lets cargo-dist coexist with
that manual section.
### 18.3 Maintainer release steps
1. Bump `version` in `Cargo.toml`.
2. Move `## [Unreleased]` → `## [X.Y.Z] — DATE` in `CHANGELOG.md`.
3. Commit with `chore: bump to X.Y.Z`.
4. Tag `vX.Y.Z`, push with `--tags`.
5. CI takes over.
---
## 19. Troubleshooting & Pitfalls
### 19.1 Common user-facing errors
| `Error: invalid selector: '...'` | Non-digit/comma/hyphen in selector | Check selector syntax; use `-e` for patterns |
| `Error: positional selectors require a seekable file; stdin is line-only` | `sel 23:260 -` or `... \| sel 23:260` | Pass a file path, or drop the column |
| `Error: --invert-match requires --regex` | `-v` without `-e` | Add `-e PATTERN` |
| `Error: --char-context requires --regex or a positional selector` | `-n N` with bare line selector | Use `-c N` for line context, or add `-e`/position |
| `Error: output file already exists: ... (use --force to overwrite)` | `-o EXISTING` | Add `--force` or pick another path |
### 19.2 Performance pitfalls
- Very large `-c` values force a larger ring buffer; memory grows O(n·L).
- Catastrophic regex patterns are impossible with the Rust regex crate
(linear time), but extremely permissive regexes still dominate the
per-line cost.
- `sel 1-10000000 huge.txt` has to count lines up to the upper bound; the
matcher itself is fast, but the I/O pass still reads and compares
every line.
### 19.3 UTF-8 pitfalls
- **Column numbers are byte positions.** A line with multi-byte UTF-8
before the target column will not align with character counts from a
text editor. Convert ahead of time if needed.
- **Terminal rendering** depends on the user's terminal; `sel` emits raw
ANSI and line contents. Control bytes in input will be passed through
verbatim.
### 19.4 Shell pitfalls
- `!` in a regex may be history-expanded by bash/zsh; quote the pattern:
`sel -e '!important' notes.txt`.
- Negative-looking patterns (`-foo`) can confuse `clap`; separate with
`--`: `sel -e -- '-foo' logs/*.log`.
---
## 20. Glossary
**Emit.** A line produced by the expander, carrying metadata (role, spans).
Formatters convert emits to bytes.
**Expander.** Stage between matcher and formatter. Decides what is emitted
and when: just hits, or hits plus context.
**Formatter.** Converts an `Emit` into bytes to write to a `Sink`.
**Hit.** A line that matched the matcher's criterion. `MatchInfo.hit`.
**Line spec.** A single line or a range (`LineSpec::Single | Range`).
**Matcher.** Per-line classifier. Stateful: may remember where it is in a
sorted list of positions.
**Position.** A (line, column) pair.
**Role.** Whether an emit is a target or a context neighbour.
**Seekable.** A source that can be paired with a positional selector.
Only `FileSource` is seekable.
**Selector.** Parsed form of the user's selector argument:
`All | LineNumbers(Vec<LineSpec>) | Positions(Vec<Position>)`.
**Sink.** Output destination: `StdoutSink` or `FileSink`.
**Source.** Input origin: `FileSource` or `StdinSource`. Streams lines.
**Typestate.** The pattern used in `app.rs` where `App<K>` and the stages
use phantom types to encode valid transitions.
---
## Appendix A — File Index
Each path is relative to the repo root.
```
Cargo.toml manifest + dist config
README.md quickstart for users
CHANGELOG.md version history
CONTRIBUTING.md dev loop, test rules, release steps
CODE_OF_CONDUCT.md community standards
LICENSE-MIT MIT
LICENSE-APACHE Apache-2.0
PLAN.md design notes (historical)
src/main.rs binary entry point (30 lines)
src/lib.rs public re-exports
src/cli.rs clap struct, args → App wiring, ColorMode
src/app.rs Stage1..5, App<K>, Seek / NonSeek
src/pipeline.rs the single driver loop
src/selector.rs Selector, LineSpec, Position, parse/normalize
src/context.rs Expander, NoContext, LineContext, EmitOwned
src/error.rs SelError + Result
src/types.rs Line, MatchInfo, Role, Emit
src/matcher/mod.rs Matcher trait + AllMatcher
src/matcher/lines.rs LineMatcher
src/matcher/position.rs PositionMatcher (seek-only)
src/matcher/regex.rs RegexMatcher (bytes regex)
src/source/mod.rs Source trait
src/source/file.rs FileSource
src/source/stdin.rs StdinSource
src/format/mod.rs Formatter, FormatOpts
src/format/plain.rs PlainFormatter
src/format/fragment.rs FragmentFormatter (char-context + caret)
src/format/ansi.rs three ANSI codes + paint()
src/sink/mod.rs Sink trait
src/sink/stdout.rs StdoutSink (64 KiB BufWriter)
src/sink/file.rs FileSink (create-new / force)
tests/ per-feature integration tests
benches/large_file.rs Criterion micro-benchmarks
docs/README.md user docs index
docs/USAGE.md short CLI reference
docs/ARCHITECTURE.md short architectural summary
docs/REFERENCE.md (this file) the long-form reference
```
---
## Appendix B — Reading Paths
Suggested traversals for different audiences.
### For a user who just installed `sel`
1. [`README.md`](../README.md) — quick examples.
2. [`USAGE.md`](USAGE.md) — every flag explained once.
3. Sections [7](#7-selector-grammar) and [19](#19-troubleshooting--pitfalls)
of this file.
### For a new contributor
1. [`CONTRIBUTING.md`](../CONTRIBUTING.md) — dev loop & conventions.
2. Sections [2](#2-system-overview), [3](#3-design-decisions), and
[4](#4-the-pipeline-in-detail) of this file.
3. [`src/pipeline.rs`](../src/pipeline.rs), [`src/cli.rs`](../src/cli.rs),
[`src/app.rs`](../src/app.rs) — three files, three concerns.
4. [`tests/basic.rs`](../tests/basic.rs) to see how end-to-end tests
look; then pick a feature area matching your change and read its test
file.
### For someone embedding `sel` as a library
1. Section [14](#14-library-api--embedding) of this file.
2. [`src/lib.rs`](../src/lib.rs) — the full re-export surface.
3. rustdoc on [docs.rs/sel-rs](https://docs.rs/sel-rs).
4. Section [15](#15-extension-points) when adding a custom stage.
### For an architect reviewing the design
1. Sections [3](#3-design-decisions) and [4](#4-the-pipeline-in-detail).
2. Sections [8](#8-matchers)–[11](#11-sources--sinks) for stage-level
detail.
3. Section [16](#16-performance-characteristics) for complexity & bench
methodology.
---
*Last regenerated: see the git history of this file.*