siftdb-core 0.2.0

High-performance grep-native database for code and text collections with regex support
Documentation

SiftDB

Grep-Native, Agent-Oriented Database for Code and Text Collections

SiftDB is a purpose-built database for storing and searching codebases/text corpora. It is append-only, mmap-friendly, and optimized for fast grep/regex/substr queries with precise (file, line, text) citations.

🚀 Quick Start

Prerequisites

Installation

# Clone the repository
git clone https://github.com/your-org/siftdb
cd siftdb

# Build the project
cargo build --release

# The sift binary will be available at target/release/sift

Basic Usage

# 1. Initialize a new collection
./target/release/sift init my-project.sift

# 2. Import files from a directory
./target/release/sift import my-project.sift --from /path/to/source \
  --include "**/*.rs" --include "**/*.py" --include "**/*.js"

# 3. Search for code patterns
./target/release/sift find my-project.sift "async fn"
./target/release/sift find my-project.sift "DATABASE_URL" --limit 20

# 4. View specific files
./target/release/sift open my-project.sift --file src/main.rs --start-line 1 --end-line 50

# 5. Run performance benchmarks
./target/release/sift benchmark bench.sift --source /path/to/large/codebase

Example Session

# Initialize and import a Rust project
$ sift init rust-project.sift
✓ Collection initialized successfully

$ sift import rust-project.sift --from ./my-rust-app --include "**/*.rs"
Ingestion complete: 245 files ingested, 12 skipped, 0 errors

$ sift find rust-project.sift "println!"
Found 89 matches for: println!
src/main.rs:42: println!("Starting server on port {}", port);
src/lib.rs:18: println!("Debug: user={:?}", user);
...

$ sift find rust-project.sift "struct.*User" --format json
[{"path": "src/models.rs", "line": 15, "text": "pub struct User {"}]

Features

  • Append-only storage with CRC integrity checks
  • Fast substring search with file:line citations
  • Agent-first API designed for programmatic use
  • Crash-safe with atomic manifest updates
  • Lightweight - embeddable, local-first design
  • Language-aware file classification
  • Custom .sift format for segment files

Architecture

SiftDB uses an append-only storage model with immutable indexes:

<collection>.sift/
├── MANIFEST.a              # Current epoch and file list
├── store/
│   ├── seg-000001.sift     # Append-only frames with content
│   └── seg-000002.sift     # Additional segments
├── index/
│   ├── path.json           # Path → file handle mapping
│   └── handles.json        # Handle → segment metadata
└── tmp/                    # Staging area for atomic swaps

Frame Format

Each file is stored as a frame with the following structure:

u32   frame_len
u32   header_crc32
FileHeader (64 bytes)       # Magic, version, lang, lengths, file_id
u32   content_crc32
u8[]  content              # File content
u8[]  line_table           # Delta-encoded newline positions
u32   frame_crc32
(padded to 4KB alignment)

Installation

From crates.io (Recommended)

# Install the CLI tool
cargo install siftdb-cli

# Or add the core library to your project
cargo add siftdb-core

From source

# Clone and build
git clone https://github.com/siftdb/siftdb
cd siftdb
cargo build --release

# The sift binary will be available at target/release/sift

Usage

Initialize a collection

sift init myproject.sift

Import files from filesystem

# Import all files
sift import myproject.sift --from /path/to/source

# Import with filters
sift import myproject.sift --from /path/to/source \
  --include "**/*.rs" --include "**/*.py" \
  --exclude "**/target/**" --exclude "**/.git/**"

Search for text

# Basic substring search
sift find myproject.sift "DATABASE_URL"

# Search with path filtering
sift find myproject.sift "async fn" --path-glob "**/*.rs" --limit 50

# JSON output for programmatic use
sift find myproject.sift "TODO" --format json

View file content

# View entire file
sift open myproject.sift --file src/main.rs

# View specific line range
sift open myproject.sift --file src/lib.rs --start-line 10 --end-line 50

API Design

SiftDB is designed with agents in mind. Core operations return structured citations:

use siftdb_core::SiftDB;

let db = SiftDB::open("project.sift")?;
let snapshot = db.snapshot()?;

// Find matches with precise citations
let hits = snapshot.find("async fn", Some("**/*.rs"), Some(100))?;
for hit in hits {
    println!("{}:{}: {}", hit.path, hit.line, hit.text);
}

// Open file spans
let span = snapshot.open_span("src/main.rs", 1, 50)?;
println!("{}", span.content);

Performance Characteristics

SiftDB delivers impressive performance metrics:

  • Search Speed: 69,000+ queries/sec (substring search)
  • Import: Fast file ingestion with CRC validation
  • Memory Efficient: Append-only storage with minimal overhead
  • Scale: Multi-GB to multi-TB collections supported

Benchmark Results

Run your own benchmarks:

sift benchmark my-collection.sift --source /path/to/large/codebase

Sample output:

🎯 Key Performance Metrics:
- Import Rate: 1,250 files/sec
- Import Throughput: 45.2 MB/s  
- Average Search Rate: 69,832 queries/sec

Note: Current MVP implements basic substring search. Regex and trigram indexes coming in future milestones.

Storage Guarantees

  • Crash safety: CRC checks on all frames, atomic manifest updates
  • Consistency: Single-writer/multi-reader with epoch-based snapshots
  • Durability: All data flushed to disk before commit
  • Recovery: Automatic tail repair and index rebuilding

Development Status

Current: Milestone 0.2 (Advanced Search) - COMPLETED

  • ✅ Segment writer with CRCs and frame format
  • ✅ Path mapping and handle management
  • ✅ CLI: init, import, find, open, regex, compare
  • ✅ Basic substring search with line citations
  • ✅ Path FST indexes for fast path filtering
  • ✅ Trigram indexes for regex acceleration
  • ✅ Full regex search via sift regex command
  • ✅ Performance benchmarking with regression detection

Performance Improvements in v0.2:

  • +16.67% faster queries (90K → 105K QPS)
  • +17% better throughput (21.2 → 24.8 MB/s)
  • -13.6% faster response times

Future Milestones

  • Incremental updates and tombstones
  • SWMR locking and concurrency
  • HTTP API server
  • Python/other language bindings
  • Compaction and bloom filters

Contributing

SiftDB is designed to be the "SQLite of code search" - a reliable, embeddable database optimized for grep-like queries.

Key design principles:

  • Agent-first: APIs return structured citations, not just text
  • Append-only: Immutable storage with indexes as caches
  • Citation-precise: Every result includes file:line information
  • Crash-safe: CRCs and atomic updates throughout

License

MIT License - see LICENSE file for details.