Sorex
A full-text search engine with fuzzy and substring search, built to power search on harryzorus.xyz. Fast enough for real-time search, small enough for browsers, correct enough to prove it.
What This Is
Sorex builds compact binary search indices (.sorex format) that load instantly in WebAssembly. The algorithms are formally verified in Lean 4—not "we wrote some tests" verified, but mathematically proven correct.
┌──────────────────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Content │ ──► │ Index │ ──► │ WASM │ ──► │ Browser │ │
│ │ .json │ │ .sorex │ │ 153KB │ │ <1ms │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ Your docs Binary index Loads anywhere Instant search │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Why This Exists
The client-side search space is crowded—FlexSearch, Lunr.js, Fuse.js, MiniSearch, and others all solve this problem. Sorex makes a deliberate tradeoff: search capability and correctness over payload size and query syntax.
Most libraries use inverted indexes that can't find "script" inside "typescript". They're small and fast, but users expect substring matching. Sorex uses suffix arrays and Levenshtein automata to deliver true fuzzy and substring search—at the cost of a larger WASM bundle (153KB gzipped vs 6-25KB for pure JS alternatives).
The other bet: formal verification. Instead of hoping the ranking logic is correct, Sorex proves it in Lean 4.
- Suffix array for O(log n) prefix search—find "auth" in "authentication" instantly
- Levenshtein automata (Schulz-Mihov 2002) for typo-tolerant fuzzy matching without per-query DFA construction
- Block PFOR compression for posting lists—smaller than JSON, faster to decode
- Formal verification via Lean 4 proofs—binary search actually works, scoring hierarchy is actually correct
The result: a ~150KB WASM bundle that handles real-time search on thousands of documents.
Features
Search Capabilities
| Feature | How It Works | Performance |
|---|---|---|
| Exact Match | Inverted index with skip pointers | O(1) term lookup |
| Prefix Search | Binary search on vocabulary suffix array | O(log k) where k = vocabulary size |
| Fuzzy Search | Precomputed Levenshtein DFA traversal | ~2ms for d=1, ~5ms for d=2 |
| Field Ranking | Title > Heading > Content, mathematically proven | Compile-time verified |
| Deep Linking | Section IDs in results for #anchor navigation | Zero overhead |
Binary Format (.sorex v7)
The index format is designed for fast memory-mapped loading. v7 is self-contained: a single .sorex file includes the search index, document metadata, and the WASM runtime.
- Vocabulary: Length-prefixed UTF-8, lexicographically sorted
- Suffix Array: Term index + offset pairs for prefix search
- Inverted Index: Block PFOR-encoded posting lists with delta compression
- Levenshtein DFA: Precomputed automata for d=1 and d=2 (zero runtime cost)
- Section Table: IDs for deep linking into document sections
- CRC32 Footer: Integrity verification
Total overhead: ~15% on top of raw text (vs 2-3x for JSON-based indices).
Formal Verification
Three layers of defense against bugs:
LEAN PROOFS ──► Mathematical truth (5 proven theorems, 18 axioms)
TYPE-LEVEL WRAPPERS ──► Compile-time invariants (ValidatedSuffixEntry, etc.)
RUNTIME CONTRACTS ──► Debug-mode assertions (zero-cost in release)
The field scoring hierarchy is proven—Title always beats Heading, Heading always beats Content, regardless of position boosts. Binary search bounds are proven—results contain all matches and only matches.
Installation
Homebrew (macOS)
Cargo
From Source
Debian/Ubuntu
Usage
Build an Index
# Build index from a directory of JSON documents
# Inspect an existing index
# Build with demo HTML page
Input Directory Format
Sorex reads a directory containing a manifest.json and per-document JSON files:
docs/
├── manifest.json # Index configuration
├── getting-started.json # Document files
├── installation.json
└── api-reference.json
manifest.json:
Per-document JSON:
Browser Integration
The .sorex file is self-contained with embedded WASM. Use the generated sorex-loader.js:
import from './sorex-loader.js';
// Load index (extracts and initializes WASM automatically)
const searcher = await ;
// Search
const results = searcher.;
// Results include sectionId for deep linking
results.;
// Free when done (optional - uses FinalizationRegistry)
searcher.;
The loader is fully self-contained with no external dependencies. Multiple .sorex files with different WASM versions can coexist on the same page.
Performance
Benchmarks on typical blog content (~50 docs, ~100KB text):
| Operation | Time | Notes |
|---|---|---|
| Index build | ~10ms | Includes suffix array, inverted index, DFA |
| Index load (WASM) | ~5ms | Memory-mapped, minimal parsing |
| Exact search | <0.1ms | Inverted index lookup |
| Prefix search | <0.5ms | Binary search on vocabulary |
| Fuzzy search (d=1) | ~2ms | DFA traversal, no construction |
| Fuzzy search (d=2) | ~5ms | Larger automaton |
Comparison with alternatives (100 documents, 500KB text):
| Library | Index Size | Query Time | Bundle Size |
|---|---|---|---|
| Sorex | 85KB | <1ms | 153KB |
| Lunr.js | 240KB | 15ms | 8KB |
| FlexSearch | 180KB | 3ms | 22KB |
| Fuse.js | N/A | 30ms | 24KB |
Architecture
The codebase is organized around what things do, not what they are:
sorex/
├── src/
│ ├── types.rs # Core data structures
│ ├── index.rs # Suffix array construction
│ ├── search.rs # Binary search implementation
│ ├── scoring.rs # Field ranking (verified in Lean)
│ ├── levenshtein.rs # Edit distance computation
│ ├── levenshtein_dfa.rs # Parametric automata (Schulz-Mihov)
│ ├── inverted.rs # Inverted index + posting lists
│ ├── binary.rs # .sorex format encoding/decoding
│ ├── wasm.rs # WebAssembly bindings
│ ├── verified.rs # Type-level invariant wrappers
│ └── contracts.rs # Runtime debug assertions
│
├── lean/
│ └── SearchVerified/ # Lean 4 specifications and proofs
│ ├── Types.lean
│ ├── SuffixArray.lean
│ ├── BinarySearch.lean
│ ├── Scoring.lean
│ └── Levenshtein.lean
│
├── tests/
│ ├── invariants.rs # Lean theorem verification
│ ├── property.rs # Property-based tests (proptest)
│ └── integration.rs # End-to-end tests
│
└── docs/
├── architecture.md # Binary format, algorithm details
├── algorithms.md # Deep dive into suffix arrays, DFAs
└── integration.md # WASM integration guide
Verification
Before committing changes, verify everything works:
See Verification for the formal verification approach and CLAUDE.md for AI agent guidelines.
Documentation
| Document | What's in it |
|---|---|
| Architecture | Binary format spec, algorithm overview |
| Algorithms | Suffix arrays, Levenshtein automata, Block PFOR |
| Benchmarks | Performance comparisons with FlexSearch, lunr.js, etc. |
| Integration | WASM setup, browser integration, TypeScript types |
| Verification | Formal verification guide, safe refactoring |
| Contributing | How to contribute without breaking proofs |
License
Apache-2.0. See LICENSE for details.
Built with Rust and verified with Lean 4.