syntext
A faster grep for agentic AI. ~20X Faster than Ripgrep when indexed
This is largely based on information derivied from this Cursor Blog Post and GitHub Code Search.
A hybrid code search index for agent workflows, built in Rust. Indexes repositories using sparse n-grams with a pre-trained frequency weight table, then narrows to a small candidate set before verification. Designed as a drop-in replacement for rg in AI agent loops where grep is called repeatedly and in parallel.
Status: stable (v1.0). See Project status below.
Installation
macOS (Homebrew)
Homebrew casks are published in whit3rabbit/homebrew-tap.
Linux
Download release artifacts from the GitHub releases page.
VERSION=1.0.0
# Debian/Ubuntu
# Any Linux (x86_64)
&&
From source
Why this exists
AI coding agents call grep dozens of times per task. On large monorepos, each rg invocation touches every file. Those calls compound into significant stalled agent time per coding session.
syntext builds a content index so queries only touch candidate files, not all files. The verifier confirms matches against actual file content, so results are correct (identical to ripgrep).
Benchmarks
Real-world benchmark runs are tracked in
docs/BENCHMARKS.md. The table below is
the current 2026-03-29 snapshot from preset-backed external runs using the
shared harness scripts/bench_compare.py.
Search latency
The chart uses the mean latency across each preset's exact-count queries. It
compares the two tools that ran on every preset. grep is listed in the table
below, but the Linux preset intentionally skips it.
xychart-beta
title "External preset search latency (ms, lower is better)"
x-axis ["React", "Rust", "TypeScript", "Node", "Linux"]
y-axis "Milliseconds" 0 --> 4000
bar "syntext" [21, 100, 112, 70, 155]
bar "rg" [113, 2183, 3094, 1493, 3681]
| Repo | syntext avg |
rg avg |
grep avg |
Speedup vs rg |
|---|---|---|---|---|
| React | 20.7 ms |
112.9 ms |
314.3 ms |
5.5x |
| Rust compiler | 99.9 ms |
2183.2 ms |
2412.8 ms |
21.9x |
| TypeScript | 111.9 ms |
3093.8 ms |
3171.8 ms |
27.7x |
| Node.js | 69.5 ms |
1492.6 ms |
3186.4 ms |
21.5x |
| Linux kernel | 154.5 ms |
3681.3 ms |
n/a |
23.8x |
Method:
- Note: These benchmarks were run against
syntextversion1.0.0. - External repos use the same harness and preset catalog.
- The chart and table use arithmetic means across each preset's exact-count queries.
- Times are single-run preset medians on macOS unless noted otherwise.
syntextsearch time excludes index build time. Build time is shown separately.- Linux uses the cheaper shared large-corpus mode (
syntext+rg) because the full three-tool run is too expensive on the benchmark machine.
Notes:
- Latency grows roughly with repo size for scan tools, while
syntextstays under155 mson every preset in this matrix. - The exact-count validated preset terms are documented in docs/BENCHMARKS.md.
- This refreshed matrix covers React, Rust, TypeScript, Node.js, and Linux.
- Every query in the 2026-03-29 matrix matched the comparator tool counts.
- Substring-heavy terms such as
ReactElement,useEffect, andTyCtxtare intentionally not in the headline README table because they can undercount insyntextrelative torg. - Historical and exploratory runs, including mismatched-count investigations, remain in docs/BENCHMARKS.md.
Performance Summary
In indexed search scenarios, syntext materially outperforms rg by shrinking
the candidate set before verification.
- Average speedup across the five presets above:
20.1xversusrg. - Worst case in this matrix is still
5.5xfaster thanrgon React. - Largest corpus in this matrix, the Linux kernel, still averaged
154.5 ms.
Index build time
| Repo | Preset | Tracked files | Tools | st index |
|---|---|---|---|---|
| React | react_token_aligned |
6,840 | syntext, rg, grep |
746.003 ms |
| Rust compiler | rust_token_aligned |
58,698 | syntext, rg, grep |
3376.174 ms |
| TypeScript | typescript_compiler |
81,362 | syntext, rg, grep |
4807.992 ms |
| Node.js | node_runtime |
47,364 | syntext, rg, grep |
3991.465 ms |
| Linux kernel | linux_token_aligned |
93,018 | syntext, rg |
8357.722 ms |
Usage
CLI
# Build the index
# Search the whole repo
# Restrict search scope with positional paths
# Additional filters and output modes
# Incremental update after edits
# Status
Notes:
- Search is the default command, there is no
st searchsubcommand. - Like ripgrep, file names are shown by default when searching a directory, the whole repo, or multiple positional paths.
- Like ripgrep, line numbers are off by default when stdout is not a TTY. Use
-nto force them on.
Library
use ;
let config = Config ;
let index = open?;
index.build?;
// Search
let results = index.search?;
// Agent workflow: edit files, then search
index.notify_change?;
index.notify_change?;
index.commit_batch?; // atomic visibility
let fresh_results = index.search?;
WASM / Browser / Node.js
The wasm Cargo feature compiles syntext to a fully in-memory index with no filesystem access. Files are provided by the caller as a JS object mapping paths to Uint8Array content.
Build from source
# output: pkg/ (JS glue + .wasm + TypeScript types)
Alternatively, download syntext-wasm-<version>.tar.gz from the releases page.
Usage
import init from "./syntext_bg.js";
await ;
const enc = ;
const idx = ;
const matches = idx.;
// [{path: "src/lib.rs", line_number: 1, line_content: "pub fn build_index() {}",
// submatch_start: 7, submatch_end: 18}]
The WasmIndex constructor indexes everything up front; search() is synchronous and can be called repeatedly. Accepts any pattern that the native st CLI accepts (literal, regex, -F flag behavior is not exposed at the WASM level).
Weight table
src/tokenizer/weights.rs is a pre-trained [u16; 65536] byte-pair frequency table. Rare pairs get high weights (gram boundaries), common pairs get low weights (gram interiors).
Two generation paths:
| Script | Corpus | When to use |
|---|---|---|
scripts/weights_gen.py |
~175 MB from bigcode/the-stack-smol (default) |
Local regeneration, CI |
scripts/notebooks/weights_gen_colab.ipynb |
100 GB – 500 GB+ from bigcode/the-stack-dedup |
Higher quality, run on Colab Pro |
The current shipped table was trained on ~498 GB across 20+ languages (49.7% pair
coverage, 32,542 / 65,536 non-zero pairs). The Colab notebook uses bulk Parquet
download with checkpointing after every shard (safe against disconnects) and emits
a weights.rs ready to drop into src/tokenizer/. HuggingFace access required
for the-stack-dedup.
Architecture
For the full quantitative analysis (selectivity math, index size estimates, posting list encoding tradeoffs), see docs/ARCHITECTURE.md.
The high-level flow:
Query -> Router -> [Literal | Indexed Regex | Full Scan]
|
Gram extraction
|
Posting list intersection (smallest-first)
|
Candidate file IDs
|
Verifier (memchr or regex against file content)
|
Results
Three index components feed candidate selection:
- Content index: sparse n-gram posting lists (the core). Trigram augmentation ensures no false negatives for token-aligned queries.
- Path index: Roaring bitmap component sets for path/type filtering.
- Symbol index (optional): Tree-sitter extraction into SQLite.
Segments are immutable single-file mmap structures (SNTX format). Updates go through an in-memory overlay with atomic batch commit via ArcSwap.
Project status
All phases complete (v1.0). The core st index && st search "pattern" workflow is functional and validated against ripgrep. Symbol search is available behind --features symbols.
See specs/001-hybrid-code-search-index/tasks.md for the full implementation plan with 69 tasks across 9 phases.
| Phase | Status | What it delivers |
|---|---|---|
| 1. Setup | Complete | Cargo project, dependencies, module structure |
| 2. Foundational | Complete | Weight table, tokenizer, posting lists, correctness harness |
| 3. US5 -- Build | Complete | Full index build from scratch |
| 4. US1 -- Search | Complete | Literal + regex search, ripgrep correctness validation |
| 5. US2 -- Incremental | Complete | Overlay, batch commit, read-your-writes |
| 6. US3 -- Path scoping | Complete | Path/type filters with Roaring bitmaps |
| 7. US4 -- Symbols | Complete | Tree-sitter symbol extraction, SQLite storage |
| 8. CLI | Complete | st binary with grep-compatible output |
| 9. Polish | Complete | Bug fixes, security hardening, benchmarks, documentation |
Known limitations
- Crash recovery: Overlay state is lost on unclean shutdown. Run
st updateorst indexafter a crash. - Invert match scope:
st -vinverts within candidate files only, not the full corpus. - Non-aligned substring coverage: ~16% false-negative rate for queries that don't align with token boundaries. Token-aligned queries (identifiers, keywords) have 0% false negatives.
- Network filesystems: Index directory must be on local filesystem. NFS/SMB behavior is undefined.
- Case-insensitive overhead: ~15-20% more candidates due to lowercase normalization. Correct results guaranteed by verifier.
\r-only line endings: Treated as a single line (matches ripgrep behavior).- Symbol search accuracy: Tier 3 (heuristic) results are approximate. Tree-sitter failures fall back silently.
Design documents
-
docs/ARCHITECTURE.md -- Quantitative analysis: selectivity math, index size estimates, posting list encoding, design tradeoffs Detailed specs in
specs/001-hybrid-code-search-index/: -
spec.md -- Feature specification with user stories and acceptance criteria
-
research.md -- 19-section architecture research covering every subsystem
-
data-model.md -- Entity definitions and relationships
-
contracts/ -- Library API, CLI, and segment format contracts
-
tasks.md -- Implementation plan with dependency graph
License
MIT