sqlrite 1.0.2 - Docs.rs

# Public Dataset Evaluation Methodology

This benchmark extends the paper beyond synthetic workloads.

## Dataset

- Dataset: `BEIR/SciFact`
- Source: the public BEIR dataset release
- Corpus fields used: title + abstract text
- Query relevance: official SciFact qrels

## Embeddings

To keep the benchmark fully local and reproducible, the benchmark uses a deterministic hashed embedding function instead of a hosted or heavyweight neural embedding model.

Properties:

- token hashing into a fixed-dimensional dense vector
- L2 normalization
- same embedding function for corpus and query text
- no external API dependency

This is not intended to represent state-of-the-art semantic embedding quality. It is intended to create a reproducible shared dense representation so systems can be compared on the same public corpus.

## Benchmarks

### 1. Vector Exact Benchmark

Compared systems:

- `SQLRite brute_force`
- `sqlite-vec exact`
- `pgvector exact`

Metrics:

- QPS
- p50 latency
- p95 latency
- recall@k
- MRR@k
- NDCG@k

### 2. Hybrid Lexical + Vector Benchmark

Compared systems:

- `SQLRite hybrid`
- `pgvector hybrid`

Reason for this narrower set:

- both systems support a meaningful lexical + vector query path on the same data
- `sqlite-vec` is vector-only, so hybrid parity is not available there

## Interpretation

This benchmark is stronger than a synthetic-only benchmark because it uses public queries and qrels. It is still limited by the deterministic local embedding function and a single-host setup.