siftdb-core 0.2.0

# SiftDB

**Grep-Native, Agent-Oriented Database for Code and Text Collections**

SiftDB is a purpose-built database for storing and searching codebases/text corpora. It is append-only, mmap-friendly, and optimized for fast grep/regex/substr queries with precise (file, line, text) citations.

## 🚀 Quick Start

### Prerequisites
- Rust 1.70+ (install from [rustup.rs](https://rustup.rs/))

### Installation

```bash
# Clone the repository
git clone https://github.com/your-org/siftdb
cd siftdb

# Build the project
cargo build --release

# The sift binary will be available at target/release/sift
```

### Basic Usage

```bash
# 1. Initialize a new collection
./target/release/sift init my-project.sift

# 2. Import files from a directory
./target/release/sift import my-project.sift --from /path/to/source \
  --include "**/*.rs" --include "**/*.py" --include "**/*.js"

# 3. Search for code patterns
./target/release/sift find my-project.sift "async fn"
./target/release/sift find my-project.sift "DATABASE_URL" --limit 20

# 4. View specific files
./target/release/sift open my-project.sift --file src/main.rs --start-line 1 --end-line 50

# 5. Run performance benchmarks
./target/release/sift benchmark bench.sift --source /path/to/large/codebase
```

### Example Session

```bash
# Initialize and import a Rust project
$ sift init rust-project.sift
✓ Collection initialized successfully

$ sift import rust-project.sift --from ./my-rust-app --include "**/*.rs"
Ingestion complete: 245 files ingested, 12 skipped, 0 errors

$ sift find rust-project.sift "println!"
Found 89 matches for: println!
src/main.rs:42: println!("Starting server on port {}", port);
src/lib.rs:18: println!("Debug: user={:?}", user);
...

$ sift find rust-project.sift "struct.*User" --format json
[{"path": "src/models.rs", "line": 15, "text": "pub struct User {"}]
```

## Features

- **Append-only storage** with CRC integrity checks
- **Fast substring search** with file:line citations  
- **Agent-first API** designed for programmatic use
- **Crash-safe** with atomic manifest updates
- **Lightweight** - embeddable, local-first design
- **Language-aware** file classification
- **Custom .sift format** for segment files

## Architecture

SiftDB uses an append-only storage model with immutable indexes:

```
<collection>.sift/
├── MANIFEST.a              # Current epoch and file list
├── store/
│   ├── seg-000001.sift     # Append-only frames with content
│   └── seg-000002.sift     # Additional segments
├── index/
│   ├── path.json           # Path → file handle mapping
│   └── handles.json        # Handle → segment metadata
└── tmp/                    # Staging area for atomic swaps
```

### Frame Format

Each file is stored as a frame with the following structure:

```
u32   frame_len
u32   header_crc32
FileHeader (64 bytes)       # Magic, version, lang, lengths, file_id
u32   content_crc32
u8[]  content              # File content
u8[]  line_table           # Delta-encoded newline positions
u32   frame_crc32
(padded to 4KB alignment)
```

## Installation

### From crates.io (Recommended)

```bash
# Install the CLI tool
cargo install siftdb-cli

# Or add the core library to your project
cargo add siftdb-core
```

### From source

```bash
# Clone and build
git clone https://github.com/siftdb/siftdb
cd siftdb
cargo build --release

# The sift binary will be available at target/release/sift
```

## Usage

### Initialize a collection

```bash
sift init myproject.sift
```

### Import files from filesystem

```bash
# Import all files
sift import myproject.sift --from /path/to/source

# Import with filters
sift import myproject.sift --from /path/to/source \
  --include "**/*.rs" --include "**/*.py" \
  --exclude "**/target/**" --exclude "**/.git/**"
```

### Search for text

```bash
# Basic substring search
sift find myproject.sift "DATABASE_URL"

# Search with path filtering
sift find myproject.sift "async fn" --path-glob "**/*.rs" --limit 50

# JSON output for programmatic use
sift find myproject.sift "TODO" --format json
```

### View file content

```bash
# View entire file
sift open myproject.sift --file src/main.rs

# View specific line range
sift open myproject.sift --file src/lib.rs --start-line 10 --end-line 50
```

## API Design

SiftDB is designed with agents in mind. Core operations return structured citations:

```rust
use siftdb_core::SiftDB;

let db = SiftDB::open("project.sift")?;
let snapshot = db.snapshot()?;

// Find matches with precise citations
let hits = snapshot.find("async fn", Some("**/*.rs"), Some(100))?;
for hit in hits {
    println!("{}:{}: {}", hit.path, hit.line, hit.text);
}

// Open file spans
let span = snapshot.open_span("src/main.rs", 1, 50)?;
println!("{}", span.content);
```

## Performance Characteristics

SiftDB delivers impressive performance metrics:

- **Search Speed**: 69,000+ queries/sec (substring search)
- **Import**: Fast file ingestion with CRC validation
- **Memory Efficient**: Append-only storage with minimal overhead
- **Scale**: Multi-GB to multi-TB collections supported

### Benchmark Results

Run your own benchmarks:
```bash
sift benchmark my-collection.sift --source /path/to/large/codebase
```

Sample output:
```
🎯 Key Performance Metrics:
- Import Rate: 1,250 files/sec
- Import Throughput: 45.2 MB/s  
- Average Search Rate: 69,832 queries/sec
```

*Note: Current MVP implements basic substring search. Regex and trigram indexes coming in future milestones.*

## Storage Guarantees

- **Crash safety**: CRC checks on all frames, atomic manifest updates
- **Consistency**: Single-writer/multi-reader with epoch-based snapshots
- **Durability**: All data flushed to disk before commit
- **Recovery**: Automatic tail repair and index rebuilding

## Development Status

**Current: Milestone 0.2 (Advanced Search) - COMPLETED**
- ✅ Segment writer with CRCs and frame format
- ✅ Path mapping and handle management  
- ✅ CLI: init, import, find, open, regex, compare
- ✅ Basic substring search with line citations
- ✅ **Path FST indexes** for fast path filtering
- ✅ **Trigram indexes** for regex acceleration
- ✅ **Full regex search** via `sift regex` command
- ✅ **Performance benchmarking** with regression detection

**Performance Improvements in v0.2:**
- +16.67% faster queries (90K → 105K QPS)
- +17% better throughput (21.2 → 24.8 MB/s)
- -13.6% faster response times

**Future Milestones**
- [ ] Incremental updates and tombstones
- [ ] SWMR locking and concurrency
- [ ] HTTP API server
- [ ] Python/other language bindings
- [ ] Compaction and bloom filters

## Contributing

SiftDB is designed to be the "SQLite of code search" - a reliable, embeddable database optimized for grep-like queries. 

Key design principles:
- **Agent-first**: APIs return structured citations, not just text
- **Append-only**: Immutable storage with indexes as caches
- **Citation-precise**: Every result includes file:line information
- **Crash-safe**: CRCs and atomic updates throughout

## License

MIT License - see LICENSE file for details.