symbols

A CLI tool that extracts symbols from a directory and returns a token-efficient breakdown, designed for feeding codebase context into LLM prompts.

Design

This section describes the target design. See issues.md for current implementation status and remaining work.

Parsing

Uses tree-sitter for language-agnostic symbol extraction. Each supported language has a small query file (.scm) that maps language-specific node types to a common symbol model (functions, types, modules, exports, etc.).

Token Budgeting

Takes a --budget flag (in words).

Granularity function: A single function render(level, path, content) -> output maps each file to its output at a given granularity level. The function is depth-aware (via path) and file-size-aware (via content), so a single level can behave differently for shallow vs deep files or small vs large files. A MAX_LEVEL constant defines the highest available level.

Monotonicity invariant: For any given file, a higher level must never produce fewer words than a lower level. This is tested against fixture files across all levels.

Budget algorithm: Binary search over 0..=MAX_LEVEL to find the highest level where the total word count across all files fits within the budget.

Starting levels (expected to evolve — levels differ in which lines are included and how much of each line is shown, but all file content is line-prefixes with line numbers):

File paths only
Symbol lines, truncated to symbol name (e.g. pub fn new)
Symbol lines, full line-prefix including signature (e.g. pub fn new(lang: Language) -> Self {)
Full source (all lines)

Intermediate levels can be added over time (e.g. multi-line signatures, docstrings).

Shallower files are prioritized over deeper ones when budget is tight. Users can zoom into subdirectories by running the tool on them directly with a larger budget.

Output Format

Plain text / markdown, optimized for direct use in LLM prompts.

Line-prefix constraint: Each output line is a prefix of the actual line in the source file, preserving original whitespace and indentation. The tool extracts from the source rather than synthesizing new representations — no cross-language normalization of keywords. This means nesting (e.g. methods inside a Rust impl block) is represented naturally by the source's own indentation.

Line numbers use a right-aligned format with an arrow separator. At name-only level:

src/parser.rs
     1→impl Parser
    12→    pub fn new
    45→    pub fn parse

At signature level:

src/parser.rs
     1→impl Parser {
    12→    pub fn new(lang: Language) -> Self {
    45→    pub fn parse(&self, source: &str) -> Tree {

A --json flag may be added later for machine consumption.

Supported Languages

Per-language support requires a tree-sitter grammar and a query file defining: