SiftDB
Grep-Native, Agent-Oriented Database for Code and Text Collections
SiftDB is a purpose-built database for storing and searching codebases/text corpora. It is append-only, mmap-friendly, and optimized for fast grep/regex/substr queries with precise (file, line, text) citations.
🚀 Quick Start
Prerequisites
- Rust 1.70+ (install from rustup.rs)
Installation
# Clone the repository
# Build the project
# The sift binary will be available at target/release/sift
Basic Usage
# 1. Initialize a new collection
# 2. Import files from a directory
# 3. Search for code patterns
# 4. View specific files
# 5. Run performance benchmarks
Example Session
# Initialize and import a Rust project
;
;
}
Features
- Append-only storage with CRC integrity checks
- Fast substring search with file:line citations
- Agent-first API designed for programmatic use
- Crash-safe with atomic manifest updates
- Lightweight - embeddable, local-first design
- Language-aware file classification
- Custom .sift format for segment files
Architecture
SiftDB uses an append-only storage model with immutable indexes:
<collection>.sift/
├── MANIFEST.a # Current epoch and file list
├── store/
│ ├── seg-000001.sift # Append-only frames with content
│ └── seg-000002.sift # Additional segments
├── index/
│ ├── path.json # Path → file handle mapping
│ └── handles.json # Handle → segment metadata
└── tmp/ # Staging area for atomic swaps
Frame Format
Each file is stored as a frame with the following structure:
u32 frame_len
u32 header_crc32
FileHeader (64 bytes) # Magic, version, lang, lengths, file_id
u32 content_crc32
u8[] content # File content
u8[] line_table # Delta-encoded newline positions
u32 frame_crc32
(padded to 4KB alignment)
Installation
From crates.io (Recommended)
# Install the CLI tool
# Or add the core library to your project
From source
# Clone and build
# The sift binary will be available at target/release/sift
Usage
Initialize a collection
Import files from filesystem
# Import all files
# Import with filters
Search for text
# Basic substring search
# Search with path filtering
# JSON output for programmatic use
View file content
# View entire file
# View specific line range
API Design
SiftDB is designed with agents in mind. Core operations return structured citations:
use SiftDB;
let db = open?;
let snapshot = db.snapshot?;
// Find matches with precise citations
let hits = snapshot.find?;
for hit in hits
// Open file spans
let span = snapshot.open_span?;
println!;
Performance Characteristics
SiftDB delivers impressive performance metrics:
- Search Speed: 69,000+ queries/sec (substring search)
- Import: Fast file ingestion with CRC validation
- Memory Efficient: Append-only storage with minimal overhead
- Scale: Multi-GB to multi-TB collections supported
Benchmark Results
Run your own benchmarks:
Sample output:
🎯 Key Performance Metrics:
- Import Rate: 1,250 files/sec
- Import Throughput: 45.2 MB/s
- Average Search Rate: 69,832 queries/sec
Note: Current MVP implements basic substring search. Regex and trigram indexes coming in future milestones.
Storage Guarantees
- Crash safety: CRC checks on all frames, atomic manifest updates
- Consistency: Single-writer/multi-reader with epoch-based snapshots
- Durability: All data flushed to disk before commit
- Recovery: Automatic tail repair and index rebuilding
Development Status
Current: Milestone 0.2 (Advanced Search) - COMPLETED
- ✅ Segment writer with CRCs and frame format
- ✅ Path mapping and handle management
- ✅ CLI: init, import, find, open, regex, compare
- ✅ Basic substring search with line citations
- ✅ Path FST indexes for fast path filtering
- ✅ Trigram indexes for regex acceleration
- ✅ Full regex search via
sift regexcommand - ✅ Performance benchmarking with regression detection
Performance Improvements in v0.2:
- +16.67% faster queries (90K → 105K QPS)
- +17% better throughput (21.2 → 24.8 MB/s)
- -13.6% faster response times
Future Milestones
- Incremental updates and tombstones
- SWMR locking and concurrency
- HTTP API server
- Python/other language bindings
- Compaction and bloom filters
Contributing
SiftDB is designed to be the "SQLite of code search" - a reliable, embeddable database optimized for grep-like queries.
Key design principles:
- Agent-first: APIs return structured citations, not just text
- Append-only: Immutable storage with indexes as caches
- Citation-precise: Every result includes file:line information
- Crash-safe: CRCs and atomic updates throughout
License
MIT License - see LICENSE file for details.