sorex 0.2.5

Formally verified full-text search with suffix arrays
sorex-0.2.5 has been yanked.

Sorex

A full-text search engine with fuzzy and substring search, built to power search on harryzorus.xyz. Fast enough for real-time search, small enough for browsers, correct enough to prove it.

What This Is

Sorex builds compact binary search indices (.sorex format) that load instantly in WebAssembly. The algorithms are formally verified in Lean 4—not "we wrote some tests" verified, but mathematically proven correct.

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│   ┌─────────┐     ┌─────────┐     ┌─────────┐     ┌─────────┐            │
│   │ Content │ ──► │  Index  │ ──► │  WASM   │ ──► │ Browser │            │
│   │  .json  │     │  .sorex │     │  153KB  │     │  <1ms   │            │
│   └─────────┘     └─────────┘     └─────────┘     └─────────┘            │
│                                                                          │
│   Your docs       Binary index    Loads anywhere  Instant search         │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Why This Exists

The client-side search space is crowded—FlexSearch, Lunr.js, Fuse.js, MiniSearch, and others all solve this problem. Sorex makes a deliberate tradeoff: search capability and correctness over payload size and query syntax.

Most libraries use inverted indexes that can't find "script" inside "typescript". They're small and fast, but users expect substring matching. Sorex uses suffix arrays and Levenshtein automata to deliver true fuzzy and substring search—at the cost of a larger WASM bundle (153KB gzipped vs 6-25KB for pure JS alternatives).

The other bet: formal verification. Instead of hoping the ranking logic is correct, Sorex proves it in Lean 4.

  • Suffix array for O(log n) prefix search—find "auth" in "authentication" instantly
  • Levenshtein automata (Schulz-Mihov 2002) for typo-tolerant fuzzy matching without per-query DFA construction
  • Block PFOR compression for posting lists—smaller than JSON, faster to decode
  • Formal verification via Lean 4 proofs—binary search actually works, scoring hierarchy is actually correct

The result: a ~150KB WASM bundle that handles real-time search on thousands of documents.


Features

Search Capabilities

Feature How It Works Performance
Exact Match Inverted index with skip pointers O(1) term lookup
Prefix Search Binary search on vocabulary suffix array O(log k) where k = vocabulary size
Fuzzy Search Precomputed Levenshtein DFA traversal ~2ms for d=1, ~5ms for d=2
Field Ranking Title > Heading > Content, mathematically proven Compile-time verified
Deep Linking Section IDs in results for #anchor navigation Zero overhead

Binary Format (.sorex v7)

The index format is designed for fast memory-mapped loading. v7 is self-contained: a single .sorex file includes the search index, document metadata, and the WASM runtime.

  • Vocabulary: Length-prefixed UTF-8, lexicographically sorted
  • Suffix Array: Term index + offset pairs for prefix search
  • Inverted Index: Block PFOR-encoded posting lists with delta compression
  • Levenshtein DFA: Precomputed automata for d=1 and d=2 (zero runtime cost)
  • Section Table: IDs for deep linking into document sections
  • CRC32 Footer: Integrity verification

Total overhead: ~15% on top of raw text (vs 2-3x for JSON-based indices).

Formal Verification

Three layers of defense against bugs:

LEAN PROOFS          ──► Mathematical truth (5 proven theorems, 18 axioms)
TYPE-LEVEL WRAPPERS  ──► Compile-time invariants (ValidatedSuffixEntry, etc.)
RUNTIME CONTRACTS    ──► Debug-mode assertions (zero-cost in release)

The field scoring hierarchy is proven—Title always beats Heading, Heading always beats Content, regardless of position boosts. Binary search bounds are proven—results contain all matches and only matches.


Installation

Homebrew (macOS)

brew tap harryzorus/sorex
brew install sorex

Cargo

cargo install sorex

From Source

git clone https://github.com/harryzorus/sorex.git
cd sorex
cargo build --profile release-native

Debian/Ubuntu

cargo install cargo-deb
cargo deb --profile release-native --install

Usage

Build an Index

# Build index from a directory of JSON documents
sorex index --input ./docs --output ./search-output

# Inspect an existing index
sorex inspect ./search-output/index-*.sorex

# Build with demo HTML page
sorex index --input ./docs --output ./search-output --demo

Input Directory Format

Sorex reads a directory containing a manifest.json and per-document JSON files:

docs/
├── manifest.json          # Index configuration
├── getting-started.json   # Document files
├── installation.json
└── api-reference.json

manifest.json:

{
  "version": 1,
  "documents": ["getting-started.json", "installation.json", "api-reference.json"]
}

Per-document JSON:

{
  "id": 0,
  "slug": "getting-started",
  "title": "Getting Started",
  "excerpt": "Learn how to set up...",
  "href": "/docs/getting-started",
  "type": "doc",
  "category": "documentation",
  "text": "Getting Started Learn how to set up your first project...",
  "fieldBoundaries": [
    { "start": 0, "end": 15, "fieldType": "title", "sectionId": null },
    { "start": 16, "end": 100, "fieldType": "content", "sectionId": "intro" }
  ]
}

Browser Integration

The .sorex file is self-contained with embedded WASM. Use the generated sorex-loader.js:

import { loadSorex } from './sorex-loader.js';

// Load index (extracts and initializes WASM automatically)
const searcher = await loadSorex('./index-a1b2c3d4.sorex');

// Search
const results = searcher.search('query', 10);

// Results include sectionId for deep linking
results.forEach(r => {
  const url = r.sectionId ? `${r.href}#${r.sectionId}` : r.href;
  console.log(`${r.title}: ${url}`);
});

// Free when done (optional - uses FinalizationRegistry)
searcher.free();

The loader is fully self-contained with no external dependencies. Multiple .sorex files with different WASM versions can coexist on the same page.


Performance

Benchmarks on typical blog content (~50 docs, ~100KB text):

Operation Time Notes
Index build ~10ms Includes suffix array, inverted index, DFA
Index load (WASM) ~5ms Memory-mapped, minimal parsing
Exact search <0.1ms Inverted index lookup
Prefix search <0.5ms Binary search on vocabulary
Fuzzy search (d=1) ~2ms DFA traversal, no construction
Fuzzy search (d=2) ~5ms Larger automaton

Comparison with alternatives (100 documents, 500KB text):

Library Index Size Query Time Bundle Size
Sorex 85KB <1ms 153KB
Lunr.js 240KB 15ms 8KB
FlexSearch 180KB 3ms 22KB
Fuse.js N/A 30ms 24KB

Architecture

The codebase is organized around what things do, not what they are:

sorex/
├── src/
│   ├── types.rs          # Core data structures
│   ├── index.rs          # Suffix array construction
│   ├── search.rs         # Binary search implementation
│   ├── scoring.rs        # Field ranking (verified in Lean)
│   ├── levenshtein.rs    # Edit distance computation
│   ├── levenshtein_dfa.rs # Parametric automata (Schulz-Mihov)
│   ├── inverted.rs       # Inverted index + posting lists
│   ├── binary.rs         # .sorex format encoding/decoding
│   ├── wasm.rs           # WebAssembly bindings
│   ├── verified.rs       # Type-level invariant wrappers
│   └── contracts.rs      # Runtime debug assertions
│
├── lean/
│   └── SearchVerified/   # Lean 4 specifications and proofs
│       ├── Types.lean
│       ├── SuffixArray.lean
│       ├── BinarySearch.lean
│       ├── Scoring.lean
│       └── Levenshtein.lean
│
├── tests/
│   ├── invariants.rs     # Lean theorem verification
│   ├── property.rs       # Property-based tests (proptest)
│   └── integration.rs    # End-to-end tests
│
└── docs/
    ├── architecture.md   # Binary format, algorithm details
    ├── algorithms.md     # Deep dive into suffix arrays, DFAs
    └── integration.md    # WASM integration guide

Verification

Before committing changes, verify everything works:

cargo xtask verify    # Full suite: tests + Lean proofs + constant alignment
cargo xtask check     # Quick check: tests + clippy (no Lean)
cargo xtask test      # Just tests
cargo xtask lean      # Just Lean proofs
cargo xtask bench     # Benchmarks

See Verification for the formal verification approach and CLAUDE.md for AI agent guidelines.


Documentation

Document What's in it
Architecture Binary format spec, algorithm overview
Algorithms Suffix arrays, Levenshtein automata, Block PFOR
Benchmarks Performance comparisons with FlexSearch, lunr.js, etc.
Integration WASM setup, browser integration, TypeScript types
Verification Formal verification guide, safe refactoring
Contributing How to contribute without breaking proofs

License

Apache-2.0. See LICENSE for details.


Built with Rust and verified with Lean 4.