distx-schema 0.2.7

# DistX

[![Crates.io](https://img.shields.io/crates/v/distx.svg)](https://crates.io/crates/distx)
[![Documentation](https://docs.rs/distx/badge.svg)](https://docs.rs/distx)
[![Docker](https://img.shields.io/docker/v/distx/distx?label=docker)](https://hub.docker.com/r/distx/distx)
[![License](https://img.shields.io/crates/l/distx.svg)](https://github.com/antonellof/DistX#license)

> **DistX does not store vectors that represent objects.  
> It stores objects, and derives vectors from their structure.**

A high-performance vector database with the **Similarity Contract** — a schema-driven approach to structured similarity search that is deterministic, explainable, and requires no external ML.

---

## The Similarity Contract

The schema is not just configuration — it's a **contract** that governs:

| Aspect | What the Schema Controls |
|--------|--------------------------|
| **Ingest** | How objects are converted to vectors (deterministic, reproducible) |
| **Query** | How similarity is computed across multiple field types |
| **Ranking** | How results are scored with structured distance functions |
| **Explainability** | How each field contributes to the final score |

**This is an architectural difference**, not just an API feature. You cannot replicate this with Qdrant hybrid queries without replicating half this codebase client-side.

---

## What DistX Is (and Is Not)

| DistX IS | DistX is NOT |
|----------|--------------|
| A contract-based similarity engine | A neural embedding model |
| Deterministic and reproducible | A probabilistic LLM system |
| Designed for structured/tabular data | A black-box recommender |
| Fully explainable (per-field scores) | Dependent on external ML APIs |

**Target domains**: ERP, e-commerce, CRM, financial data, any tabular dataset.

---

## How It Works

```
┌──────────────────────────────────────────────────────────────────────────┐
│  Traditional Vector Database                                             │
│  ───────────────────────────                                             │
│  Your Data → External ML API → Embeddings → Vector DB → Score: 0.87     │
│              (cost per call)   (black box)              (unexplained)    │
│              (model drift)     (retraining)             (no breakdown)   │
├──────────────────────────────────────────────────────────────────────────┤
│  DistX with Similarity Contract                                          │
│  ──────────────────────────────                                          │
│  Your Data → Schema (JSON) → Deterministic → Explainable Results        │
│              (contract)       (no drift)     (name: 0.25, price: 0.22)   │
│              (stable)         (reproducible) (auditable)                 │
└──────────────────────────────────────────────────────────────────────────┘
```

📖 [Detailed comparison with Qdrant, Pinecone, Elasticsearch →](documentation/COMPARISON.md)

---

## Similarity Contract Engine

**The first schema-driven structured similarity engine with built-in explainability.**

Define a Similarity Contract, insert your data, and query by example — vectors are derived automatically from object structure. No external ML, no embedding pipelines, no black-box scores.

```bash
# 1. Define similarity schema
curl -X PUT http://localhost:6333/collections/products -H "Content-Type: application/json" -d '{
  "similarity_schema": {
    "fields": {
      "name": {"type": "text", "weight": 0.4},
      "price": {"type": "number", "distance": "relative", "weight": 0.3},
      "category": {"type": "categorical", "weight": 0.2},
      "brand": {"type": "categorical", "weight": 0.1}
    }
  }
}'

# 2. Insert data (vectors auto-generated)
curl -X PUT http://localhost:6333/collections/products/points -H "Content-Type: application/json" -d '{
  "points": [
    {"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "brand": "Parma"}},
    {"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "brand": "Negroni"}},
    {"id": 3, "payload": {"name": "iPhone 15 Pro", "price": 1199, "category": "electronics", "brand": "Apple"}}
  ]
}'

# 3. Query by example
curl -X POST http://localhost:6333/collections/products/similar -H "Content-Type: application/json" \
  -d '{"example": {"name": "prosciutto crudo", "price": 8.0, "category": "salumi"}, "limit": 3}'
```

**Response includes per-field explainability:**

```
┌──────┬─────────────────────────────┬───────┬──────────────────────────────────┐
│ Rank │ Product                     │ Score │ Contribution Breakdown           │
├──────┼─────────────────────────────┼───────┼──────────────────────────────────┤
│  1   │ Prosciutto di Parma DOP     │ 0.71  │ name: 0.22, price: 0.22          │
│  2   │ Prosciutto cotto            │ 0.68  │ name: 0.25, category: 0.20       │
│  3   │ Coppa di Parma              │ 0.53  │ category: 0.20, price: 0.25      │
└──────┴─────────────────────────────┴───────┴──────────────────────────────────┘
```

### Key Capabilities

| Capability | Description |
|------------|-------------|
| **Schema-Driven** | Declarative field definitions with typed similarity (text, number, categorical, boolean) |
| **Auto-Embedding** | Deterministic vector generation from structured payloads |
| **Query by Example** | Natural JSON queries instead of raw vectors |
| **Explainable Scoring** | Per-field contribution breakdown for every result |
| **Dynamic Weights** | Override field importance at query time without re-indexing |
| **Zero External Dependencies** | Fully self-contained, works offline and air-gapped |

### What You Can Do That Qdrant Cannot

#### Example 1: Change Similarity Semantics Without Re-embedding

```bash
# Same data, different meaning of "similar" — no re-indexing required

# Query 1: "Find similar products" (balanced)
curl -X POST /collections/products/similar \
  -d '{"example": {"name": "iPhone 15"}, "limit": 5}'

# Query 2: "Find cheaper alternatives" (boost price)
curl -X POST /collections/products/similar \
  -d '{"example": {"name": "iPhone 15"}, "weights": {"price": 0.7, "name": 0.2}, "limit": 5}'

# Query 3: "Find same brand, any price" (boost brand)
curl -X POST /collections/products/similar \
  -d '{"example": {"name": "iPhone 15"}, "weights": {"brand": 0.6, "category": 0.3}, "limit": 5}'
```

In Qdrant: You would need to re-embed everything or build complex client-side logic.

#### Example 2: Same Schema, Different Datasets

```bash
# One Similarity Contract works across domains:

# Products
{"name": "iPhone 15", "price": 999, "category": "electronics", "brand": "Apple"}

# Suppliers  
{"name": "Acme Corp", "price": 5000000, "category": "manufacturing", "brand": "certified"}

# Financial Assets
{"name": "AAPL Stock", "price": 178.50, "category": "equity", "brand": "tech"}

# Same schema, same queries, same explainability — across all datasets
```

This is **not product-specific**. The Similarity Contract is domain-agnostic.

📖 [Documentation](documentation/SIMILARITY_ENGINE.md) · [Interactive Demo](documentation/SIMILARITY_DEMO.md) · [Comparison with Alternatives](documentation/COMPARISON.md)

---

## 100% Qdrant API Compatible

DistX maintains **full compatibility with the Qdrant API**, so you can:

- ✅ Use existing Qdrant client libraries (Python, JavaScript, Rust, Go)
- ✅ Drop-in replace Qdrant in your stack
- ✅ Use Qdrant's Web Dashboard UI
- ✅ Migrate with zero code changes

The Similarity Engine is **additive** — all standard vector operations work exactly like Qdrant:

```bash
# Standard Qdrant-compatible vector search still works!
curl -X POST http://localhost:6333/collections/my_collection/points/search \
  -d '{"vector": [0.1, 0.2, 0.3, ...], "limit": 10}'
```

---

## Quick Start

### 1. Start DistX with Docker

```bash
# Pull and run (with persistent storage)
docker run -d --name distx \
  -p 6333:6333 -p 6334:6334 \
  -v distx_data:/qdrant/storage \
  distx/distx:latest

# Or with docker-compose
docker-compose up -d
```

DistX is now running at:
- **REST API**: http://localhost:6333
- **Web Dashboard**: http://localhost:6333/dashboard
- **gRPC**: localhost:6334

### 2. Create a Collection with Similarity Schema

```bash
curl -X PUT http://localhost:6333/collections/products \
  -H "Content-Type: application/json" \
  -d '{
    "similarity_schema": {
      "fields": {
        "name": {"type": "text", "weight": 0.4},
        "price": {"type": "number", "distance": "relative", "weight": 0.3},
        "category": {"type": "categorical", "weight": 0.2},
        "in_stock": {"type": "boolean", "weight": 0.1}
      }
    }
  }'
```

### 3. Insert Data (No Vectors Needed!)

```bash
curl -X PUT http://localhost:6333/collections/products/points \
  -H "Content-Type: application/json" \
  -d '{
    "points": [
      {"id": 1, "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi", "in_stock": true}},
      {"id": 2, "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi", "in_stock": true}},
      {"id": 3, "payload": {"name": "Mortadella Bologna", "price": 3.99, "category": "salumi", "in_stock": false}},
      {"id": 4, "payload": {"name": "Parmigiano Reggiano", "price": 18.99, "category": "cheese", "in_stock": true}},
      {"id": 5, "payload": {"name": "Grana Padano", "price": 14.99, "category": "cheese", "in_stock": true}}
    ]
  }'
```

### 4. Query by Example

```bash
# Find products similar to "prosciutto crudo around $8"
curl -X POST http://localhost:6333/collections/products/similar \
  -H "Content-Type: application/json" \
  -d '{
    "example": {
      "name": "prosciutto crudo",
      "price": 8.0,
      "category": "salumi"
    },
    "limit": 3
  }'
```

**Response with explainable scores:**

```json
{
  "result": [
    {
      "id": 1,
      "score": 0.71,
      "payload": {"name": "Prosciutto di Parma DOP", "price": 8.99, "category": "salumi"},
      "explain": {"name": 0.22, "price": 0.24, "category": 0.20, "in_stock": 0.05}
    },
    {
      "id": 2,
      "score": 0.65,
      "payload": {"name": "Prosciutto cotto", "price": 4.99, "category": "salumi"},
      "explain": {"name": 0.25, "price": 0.15, "category": 0.20, "in_stock": 0.05}
    }
  ]
}
```

### 5. Dynamic Weight Overrides

```bash
# Find cheaper alternatives (boost price importance)
curl -X POST http://localhost:6333/collections/products/similar \
  -H "Content-Type: application/json" \
  -d '{
    "example": {"name": "prosciutto", "category": "salumi"},
    "weights": {"price": 0.6, "name": 0.2, "category": 0.2},
    "limit": 3
  }'
```

### 6. Query by Existing Point ID

```bash
# Find products similar to ID 4 (Parmigiano Reggiano)
curl -X POST http://localhost:6333/collections/products/similar \
  -H "Content-Type: application/json" \
  -d '{"like_id": 4, "limit": 3}'
```

### 7. Run the Interactive Demo

```bash
# Full demo with sample data
python scripts/similarity_demo.py

# Or run specific demos
python scripts/similarity_demo.py --demo products
python scripts/similarity_demo.py --demo suppliers
```

### Alternative: Traditional Vector Search

DistX also supports standard Qdrant-compatible vector operations:

```bash
# Create collection with vectors
curl -X PUT http://localhost:6333/collections/embeddings \
  -H "Content-Type: application/json" \
  -d '{"vectors": {"size": 128, "distance": "Cosine"}}'

# Insert with vectors
curl -X PUT http://localhost:6333/collections/embeddings/points \
  -H "Content-Type: application/json" \
  -d '{
    "points": [
      {"id": 1, "vector": [0.1, 0.2, ...], "payload": {"text": "example"}}
    ]
  }'

# Vector search
curl -X POST http://localhost:6333/collections/embeddings/points/search \
  -H "Content-Type: application/json" \
  -d '{"vector": [0.1, 0.2, ...], "limit": 10}'
```

### Installation Alternatives

```bash
# From crates.io
cargo install distx
distx --data-dir ./data

# From source
git clone https://github.com/antonellof/distx
cd distx && cargo build --release
./target/release/distx
```

---

## Performance

| Metric | Performance |
|--------|-------------|
| **Vector Insert** | ~8,000 ops/sec |
| **Vector Search** | ~400-500 ops/sec |
| **Search Latency (p50)** | ~2ms |
| **Search Latency (p99)** | ~5ms |
| **Similarity Query** | <1ms overhead |

*Benchmarks: 5,000 vectors, 128 dimensions, Cosine distance*

---

## All Features

### Similarity Engine (NEW)
- **Schema-driven similarity** — Define what fields matter
- **Auto-embedding** — Vectors generated from payload
- **Multi-type support** — Text, number, categorical, boolean
- **Explainable results** — Per-field score breakdown
- **Dynamic weights** — Override at query time

### Vector Database
- **HNSW Index** — Fast ANN with SIMD (AVX2, SSE, NEON)
- **BM25 Text Search** — Full-text ranking
- **Payload Filtering** — JSON metadata queries
- **Dual API** — REST + gRPC
- **Persistence** — WAL, snapshots, LMDB

### Operations
- **Single Binary** — ~6MB, no dependencies
- **Docker Ready** — Single command deployment
- **Web Dashboard** — Qdrant-compatible UI

---

## Documentation

| Guide | Description |
|-------|-------------|
| [**Similarity Engine**](documentation/SIMILARITY_ENGINE.md) | Schema-driven similarity for tabular data |
| [**Similarity Demo**](documentation/SIMILARITY_DEMO.md) | Interactive walkthrough with examples |
| [**Comparison**](documentation/COMPARISON.md) | DistX vs Qdrant, Pinecone, Elasticsearch |
| [Quick Start](documentation/QUICK_START.md) | Get started in 5 minutes |
| [Docker Guide](documentation/DOCKER.md) | Container deployment |
| [API Reference](documentation/API.md) | REST and gRPC endpoints |
| [Architecture](documentation/ARCHITECTURE.md) | System design |

---

## Use Cases

### 🛒 E-Commerce & Retail

**Problem:** "Show me products similar to this one" — but similarity means different things (style, price, brand).

```bash
# Similar products for "customers also viewed"
curl -X POST /collections/products/similar -d '{
  "example": {"name": "Nike Air Max 90", "price": 129, "category": "sneakers", "brand": "Nike"},
  "limit": 6
}'

# Budget alternatives (boost price importance)
curl -X POST /collections/products/similar -d '{
  "like_id": 123,
  "weights": {"price": 0.6, "category": 0.3, "brand": 0.1}
}'
```

**Use cases:**
- "Similar products" on product pages
- "You might also like" recommendations  
- Competitor price matching (find similar products, compare prices)
- Inventory substitution (out of stock → suggest alternatives)

---

### 🏭 ERP & Supply Chain

**Problem:** Find the best supplier match based on multiple criteria without building ML pipelines.

```bash
# Find suppliers similar to your top performer
curl -X POST /collections/suppliers/similar -d '{
  "example": {
    "industry": "manufacturing",
    "annual_revenue": 5000000,
    "employee_count": 150,
    "certified": true,
    "location": "Milan"
  },
  "limit": 10
}'
```

**Use cases:**
- Supplier discovery and matching
- Vendor risk assessment (find similar vendors to flagged ones)
- Partner recommendations
- RFQ (Request for Quote) matching

---

### 👥 CRM & Customer Data

**Problem:** Find similar customers for segmentation, lead scoring, or churn prediction.

```bash
# Find customers similar to your best ones
curl -X POST /collections/customers/similar -d '{
  "example": {
    "industry": "fintech",
    "company_size": "enterprise",
    "annual_spend": 50000,
    "engagement_score": 85
  }
}'

# Find leads similar to closed-won deals
curl -X POST /collections/leads/similar -d '{
  "like_id": "deal_12345",
  "weights": {"deal_size": 0.4, "industry": 0.3, "company_size": 0.3}
}'
```

**Use cases:**
- Lead scoring (similar to converted leads?)
- Customer segmentation
- Churn prediction (similar to churned customers?)
- Account-based marketing (find lookalike companies)

---

### 🔍 Data Quality & Deduplication

**Problem:** Find duplicate or near-duplicate records without exact matching.

```bash
# Find potential duplicates
curl -X POST /collections/contacts/similar -d '{
  "example": {"name": "John Smith", "email": "j.smith@acme.com", "company": "Acme Inc"},
  "limit": 5
}'

# Response shows WHY records might be duplicates
# → name: 0.35 (similar names)
# → company: 0.25 (same company) 
# → email: 0.15 (different email domain)
```

**Use cases:**
- Contact/account deduplication
- Data cleansing before migration
- Master data management
- Merge candidate identification

---

### 📊 Data Analysis & Exploration

**Problem:** Explore datasets by finding similar records without writing complex SQL.

```bash
# "Find transactions similar to this suspicious one"
curl -X POST /collections/transactions/similar -d '{
  "example": {"amount": 9999, "merchant_category": "travel", "country": "unusual"},
  "weights": {"amount": 0.5, "merchant_category": 0.3}
}'

# "Find properties similar to this sold one"
curl -X POST /collections/properties/similar -d '{
  "example": {"sqft": 2500, "bedrooms": 4, "neighborhood": "downtown", "year_built": 2010}
}'
```

**Use cases:**
- Fraud pattern detection
- Anomaly investigation
- Comparable analysis (real estate, finance)
- Research dataset exploration

---

### ⚖️ Regulated Industries (Finance, Healthcare, Legal)

**Problem:** Need similarity search with full auditability — can't use black-box ML.

**Why DistX:**
- **Explainable scores** — Per-field contribution breakdown
- **Deterministic** — Same query always returns same explanation
- **Auditable** — Schema defines what matters, weights are transparent
- **No external APIs** — Data never leaves your infrastructure

```bash
# Healthcare: Find similar patient cases
curl -X POST /collections/patients/similar -d '{
  "example": {"diagnosis_code": "E11.9", "age_group": "65+", "comorbidities": 3}
}'

# Response includes full explanation for audit trail:
# {
#   "score": 0.78,
#   "explain": {
#     "diagnosis_code": 0.35,  ← Same diagnosis
#     "age_group": 0.25,       ← Same age bracket
#     "comorbidities": 0.18   ← Similar complexity
#   }
# }
```

**Use cases:**
- Clinical trial patient matching
- Insurance claim similarity
- Legal case precedent search
- Compliance reporting

---

### 🏠 Real Estate & Property

```bash
# Find comparable properties for valuation
curl -X POST /collections/properties/similar -d '{
  "example": {
    "sqft": 2200,
    "bedrooms": 3,
    "bathrooms": 2,
    "year_built": 2015,
    "neighborhood": "downtown",
    "property_type": "condo"
  },
  "weights": {"sqft": 0.3, "neighborhood": 0.25, "property_type": 0.2}
}'
```

**Use cases:**
- Comparable property analysis (comps)
- Property valuation
- Investment opportunity matching
- Tenant-property matching

---

### 🎯 HR & Recruiting

```bash
# Find candidates similar to your top performers
curl -X POST /collections/employees/similar -d '{
  "example": {
    "department": "engineering",
    "years_experience": 5,
    "skills": "rust,python",
    "performance_rating": "exceeds"
  }
}'
```

**Use cases:**
- Candidate matching to job requirements
- Internal mobility (find similar roles)
- Team composition analysis
- Succession planning

---

## Use as a Library

```toml
[dependencies]
distx = "0.2.5"
distx-schema = "0.2.5"  # Similarity Engine
distx-core = "0.2.5"        # Core data structures
```

```rust
use distx_similarity::{SimilaritySchema, FieldConfig, StructuredEmbedder, Reranker};
use std::collections::HashMap;

// Define schema
let mut fields = HashMap::new();
fields.insert("name".to_string(), FieldConfig::text(0.5));
fields.insert("price".to_string(), FieldConfig::number(0.3, DistanceType::Relative));
fields.insert("category".to_string(), FieldConfig::categorical(0.2));

let schema = SimilaritySchema::new(fields);
let embedder = StructuredEmbedder::new(schema.clone());

// Auto-generate vector from payload
let payload = json!({"name": "Prosciutto", "price": 8.99, "category": "salumi"});
let vector = embedder.embed(&payload);
```

---

## Links

- **Crates.io**: https://crates.io/crates/distx
- **Documentation**: https://docs.rs/distx
- **GitHub**: https://github.com/antonellof/distx

## License

Licensed under MIT OR Apache-2.0 at your option.