sift — Structural codebase index for LLM tooling

sift builds a language-agnostic structural index of a codebase using tree-sitter and optionally enriches it with semantic embeddings (candle or API-based) for natural-language code search. It is meant to be used as an LLM CLI skill: the LLM calls sift query to find definitions, trace calls, and explore code relationships — without needing embeddings or API calls for structural queries, and with embeddings for semantic ones.

Status

Structural index is stable. Semantic embedding layer is new.

Rust crate compiles and passes clippy (cognitive complexity ≤ 15)
Indexes Rust, Python, JavaScript, TypeScript, TSX, Go, C, C++, Java, Ruby, Zig, Bash via tree-sitter
Captures: function/struct/trait/enum/class/type/interface definitions, call sites (method calls, qualified calls), and import/include statements
Cross-file import resolution — imports link to the defining symbol's file/line/kind
Query commands: define, calls, callees, implements, imports, importers, file, files, symbols matching, semantic
sift skill outputs a ready-to-use LLM tool definition (OpenAI-compatible)
Semantic embeddings: API-based (OpenAI-compatible) always available; local inference via candle with the candle feature
Agentic benchmark harness in bench-fixtures/ — 25/25 structural tasks pass (17x avg token savings), 20/20 embedding tasks pass (requires API embedder)
Incremental re-index: file mtimes tracked on each index, subsequent runs only re-parse changed files. Re-index is O(changed) not O(total)
Auto-re-index on stale sift query: transparently rebuilds the index when source files change
sift watch daemon: uses notify 7 to monitor filesystem and re-index automatically on every change (non-blocking thread, 500ms debounce)
Atomic index save: .tmp + rename prevents partial-read races; V2 magic prefix for backward compat
Unit tests covering parser, index, query, event filtering, and mtime collection
No functions excluded from the complexity threshold

Install

cargo install code-sift                        # from crates.io
cargo install --features candle code-sift      # with local embeddings

Or build from source:

git clone https://github.com/rsarv3006/sift
cd sift
cargo build --release
./target/release/sift --help

Philosophy

Zero-token structural queries: Most code understanding tasks (find definition, trace callers, list symbols in a file) are purely structural and need zero LLM tokens when served by sift.
LLM skill first: sift is designed to be invoked by an LLM as a tool. sift skill outputs the tool definition for plugging into an LLM system prompt.
Local by default, language-agnostic: tree-sitter parsers for Rust, Python, JavaScript, TypeScript, TSX, Go, C, C++, Java, Ruby, Zig, and Bash. No network required.
Optional semantic search: compute embeddings during indexing (--embed, SIFT_EMBED_* env vars) and query with sift query semantic ....

Usage

# Index a codebase
sift index /path/to/project
sift index /path/to/project --embed           # + semantic embeddings

# Queries (returns JSON)
sift query "define parse_file"          # Find a definition
sift query "calls parse_file"           # Who calls it
sift query "callees parse_file"         # What it calls
sift query "implements Iterator"        # Implementations
sift query "symbols matching revenue"   # Substring name search
sift query "file main.rs"              # Symbols in a file
sift query "files"                     # All indexed files
sift query "parse_file"                # Bare name -> define
sift query --embed "semantic calculate revenue"  # Semantic search

# Watch for changes and auto-re-index
sift watch                              # watches current directory
sift watch /path/to/project --embed     # + semantic embeddings on re-index

# LLM tool definition
sift skill

Output example

{"type":"definition","name":"parse_file","kind":"function",
 "file":"src/parser.rs","line":154,"end_line":239}

{"type":"definition","name":"Calculator","kind":"struct",
 "file":"src/lib.rs","line":7,"end_line":9,
 "doc":"/// A calculator that chains operations and evaluates them sequentially."}

Semantic results include a score field (cosine similarity). Results with doc comments include a doc field:

{"type":"semantic","name":"calculate_revenue","kind":"function",
 "file":"src/finance.rs","line":42,"end_line":56,"score":0.87,
 "doc":"/// Calculate monthly recurring revenue from the subscriptions list."}

Embedding Configuration

Semantic search is optional. When you pass --embed, sift checks these sources (later wins):

Hardcoded defaults
~/.config/sift/config.toml (user-level)
.sift/config.toml (project-level, relative to cwd)
SIFT_EMBED_* environment variables

Example project config (.sift/config.toml):

[embed]
backend = "api"
api_url = "http://10.0.0.39:11434/v1/embeddings"
api_model = "nomic-embed-text"

Once set, commands work without env vars:

sift index --embed .                         # reads config
sift query --embed "semantic handle http request"

Env vars override config files and are useful for per-invocation overrides:

Variable	Default	Description
`SIFT_EMBED_BACKEND`	`auto`	`api`, `local` (requires `candle` feature), or `auto`
`SIFT_EMBED_API_KEY`	—	API key (not needed for Ollama/local endpoints)
`SIFT_EMBED_API_URL`	`https://api.openai.com/v1/embeddings`	API endpoint
`SIFT_EMBED_API_MODEL`	`text-embedding-3-small`	Model name for API backend
`SIFT_EMBED_MODEL_PATH`	—	Path to local model files (`candle` feature only)
`OPENAI_API_KEY`	—	Fallback if `SIFT_EMBED_API_KEY` is unset

If no embedding backend is available, sift prints a warning at index time telling you what to set.

Build with candle for fully local embeddings:

cargo build --release --features candle
sift index --embed .                         # uses candle automatically

Checking

make check       # lint + test + complexity
make lint        # cargo clippy (cognitive complexity threshold: 15)
make test        # unit tests (parser, index, query)
make complexity  # arborist-cli cyclomatic/cognitive complexity
make bench              # synthetic codebase benchmark (25 correctness tasks)
make bench-embed        # embedding benchmark (20 semantic tasks, requires API embedder)
make bench-incremental  # incremental re-index benchmark (time savings vs full)
make bench-real         # real-repo benchmark (requires cloned repo in /tmp/just)

The synthetic benchmark (make bench) indexes fixtures in bench-fixtures/ and verifies correctness against known-answer tasks. The real-repo benchmark (make bench-real) measures token savings against an actual open-source project (current results: 404x avg savings over naive grep+cat on the just crate, 123 source files, 577KB). The embedding benchmark (make bench-embed) tests semantic search correctness and token savings using an API embedder (e.g. Ollama with nomic-embed-text). Configure via SIFT_EMBED_BACKEND=api SIFT_EMBED_API_URL=http://localhost:11434/v1/embeddings.

Clippy config is in clippy.toml. Complexity analysis requires arborist-cli:

cargo install arborist-cli

Architecture

sift index  →  tree-sitter parses files  →  extracts symbols, calls, imports
                   └─ --embed  →  candle/API  →  computes symbol embeddings
                                           →  serializes to bincode index (.sift/index.bin)
sift query  →  loads index  →  structural or semantic queries  →  JSON output
                   └─ --embed  →  candle/API  →  embeds query for semantic search
sift skill  →  prints LLM tool definition

Roadmap

Completed

Structural index (tree-sitter: symbols, calls, imports)
CLI commands: index, query, skill
Language support: Rust, Python, JS/TS, Go, C, C++, Java, Ruby, Zig
Import and method call capture
Caller name resolution (span-based)
Unit tests across parser, index, query, event filtering, and mtime collection
Cyclomatic/cognitive complexity checking (clippy + arborist, threshold 15)
Semantic embedding layer — candle (local) + API fallback, computed during sift index --embed, queried via sift query --embed "semantic ..."
Language support: 12 languages via tree-sitter (Rust, Python, JS/TS/TSX, Go, C, C++, Java, Ruby, Zig, Bash)
Cross-file import resolution — each import links to the defining symbol's file/line/kind
Agentic benchmark harness (make bench) — 25 correctness tasks across 2 synthetic codebases
Embedding benchmark harness (make bench-embed) — 20 semantic search tasks, requires API embedder
API key optional for embedding (works with Ollama, local LLMs, etc.)
Doc comment extraction — captures ///, /** */, #, // doc comments preceding definitions; included in JSON output and embedding text for better semantic search
Binary index format (bincode) — serializes to .sift/index.bin, faster load/save than JSON for large codebases
Incremental re-index — file mtimes tracked, re-parses only changed files
Auto-re-index on stale query — transparent rebuild when source files change
sift watch daemon — filesystem watcher for continuous auto-re-index

implements by trait name — currently implements <name> finds impl blocks by type name, not trait name. Need a second impl pattern that captures the trait being implemented.
sift query streaming — for very large result sets, support pagination or streaming JSON output.
Per-language indexing performance — measure parse rates per language on large real-world repos, identify slow grammars.
Better semantic embeddings for code — currently embeds symbol name + kind + doc text; consider code-specific models (e.g. starcoder2) or including function signature for richer context.

code-sift 0.4.0