# Hyperspatial Ecosystem Redesign
**Date:** 2024-10-29
**Status:** Phase 2 Implementation In Progress
**Current Phase:** Phase 2A Complete (Week 1 - Storage Scaffolding)
**Branch:** storage-rewrite
**Goal:** Reorganize and rebuild the Hyperspatial ecosystem with clear component boundaries, simplified architecture, and Manifold as the storage foundation.
---
## Phase 2 Status Update (2025-01-XX)
**Week 1 Complete**: ✅ Storage module scaffolding done
- Old `persistence/` moved to `_old_persistence/` (reference only)
- New `storage/` module created with Manifold integration
- 12 files (1,489 lines) with comprehensive documentation
- Storage-specific errors separated from domain types
- StorageConfig with 5 presets (Default, Small, Large, ReadHeavy, WriteHeavy)
- All interfaces defined, ready for Week 2 implementation
**Next**: Week 2 - Entity & Property CRUD implementation
See **Appendix A: Module Structure** for actual implemented file layout.
---
---
## Executive Summary
This document outlines a comprehensive redesign of the Hyperspatial ecosystem to create a modular, composable platform with clear component boundaries. The redesign eliminates complexity, consolidates duplicate functionality, and establishes Manifold as the storage foundation while keeping Hyperspatial focused on its core value: hyperbolic space operations.
### Key Principles
1. **Separation of Concerns**: Each component has a single, well-defined responsibility
2. **Plug-and-Play Architecture**: Components are independently useful and optionally composable
3. **Simplify, Don't Complicate**: Use Manifold's clean primitives instead of custom implementations
4. **Bottom-Up Rebuild**: Start fresh on a new branch, rewriting cleanly using the right abstractions
5. **Greenfield Approach**: No backwards compatibility constraints, design the optimal architecture
---
## Component Ownership Model
### Tessera (Embeddings Library)
**Status:** Stable, no changes needed
**Responsibility:** Multi-paradigm embedding generation
- Dense single-vector embeddings (BGE, Nomic, GTE, Qwen, Jina)
- Multi-vector token embeddings (ColBERT)
- Sparse learned representations (SPLADE)
- Vision-language embeddings (ColPali)
- Time series forecasting (Chronos Bolt)
- GPU acceleration (Metal, CUDA)
- Standalone library with Rust + Python bindings
**Dependencies:** None
**Used by:** Hyperspatial (optional), Haematite (optional), user applications
---
### Manifold (Storage Primitives)
**Status:** Stable, no changes needed
**Responsibility:** Fast, reliable, ACID-compliant embedded storage
#### Core Features
- Column families for concurrent writes
- Write-ahead log (WAL) for fast commits (~0.5ms)
- MVCC transactions with snapshot isolation
- WASM support via OPFS
#### Domain-Specific Crates
- **manifold-vectors**: Dense, sparse, multi-vector storage with zero-copy access
- **manifold-timeseries**: Multi-granularity time series with delta encoding
- **manifold-graph**: Bidirectional graph storage with UUID vertices and typed edges
**Key Design Points:**
- Each column family is a separate file descriptor within a single database file
- This reduces file handles from (collections × shards × storage_types × separate_databases) to (collections × shards × 1 database with N column families)
- **Fixed-width UUID keys**: All entity IDs use 16-byte UUID keys for zero-copy deserialization
- Composite keys use UUID components where applicable (e.g., graph edges use UUID vertices)
**Dependencies:** None
**Used by:** Hyperspatial, compiled queries, user applications
---
### HyperQL (Query Language)
**Status:** Needs refactoring for general-purpose use
**Responsibility:** Multi-paradigm query language with compilation capabilities
#### Core Components
- **Parser**: Winnow-based parser producing AST
- **AST**: Abstract syntax tree for all query constructs
- **Compiler**: Transforms AST to execution plans
- **Optimizer**: Query optimization (predicate pushdown, constant folding, etc.)
- **Executor**: Executes plans against DataSource implementations
- **Type Checker**: Static type validation
- **Code Generator**: NEW - Compiles queries to Rust code
#### What HyperQL Should Own
- Query syntax and semantics
- Execution plan definitions
- Optimization rules
- DataSource trait (abstraction for backends)
- Geometric/vector operation types in execution plans
- Schema definitions (table structure, column types)
- Type system for entities, properties, values
#### What HyperQL Should NOT Own
- Actual execution of geometric operations (Hyperspatial's responsibility)
- Storage implementation details (Manifold's responsibility)
- Hyperbolic-specific logic (Hyperspatial's responsibility)
- Embedding generation (Tessera's responsibility)
**Dependencies:** None (general-purpose)
**Used by:** Hyperspatial (implements DataSource), compiled queries, user applications
#### Workspace Structure
```
hyperQL/
├── src/ # Core query language library
├── crates/
│ ├── hyperql-compile/ # NEW: Query compilation to Rust code
│ └── hyperql-treesitter/ # NEW: Tree-sitter grammar for editor integration
├── examples/
├── tests/
└── Cargo.toml
```
**New Crates:**
- **hyperql-compile**: Compiles HyperQL queries to type-safe Rust code
- **hyperql-treesitter**: Tree-sitter grammar for incremental parsing, syntax highlighting, and LSP integration
---
### Hyperspatial (Hyperbolic Space Database)
**Status:** Major redesign needed (70-75% of work here)
**Responsibility:** Multi-paradigm database unified through hyperbolic space
#### Core Value Propositions
1. **Hyperbolic Positioning**: Self-supervised learning of geometric positions
2. **Sub-millisecond Similarity**: HNSW index for fast k-NN queries
3. **Multi-Modal Unification**: Graph + vectors + properties in unified space
4. **Cascade System**: Schema-driven hierarchical aggregations
5. **Stream Coordination**: Ephemeral streams for data workflows
6. **Server Infrastructure**: HTTP/WebSocket API for clients
#### What Hyperspatial Should Own
**Hyperbolic Space Operations**
- Hyperbolic distance calculations (hyperboloid model)
- Multi-modal positioning (graph, embedding, property positions)
- Position learning algorithms (degree-based, self-supervised)
- Trajectory tracking and drift analysis
- HNSW index adapted for hyperbolic space
**Cascade System**
- Cascade dependency graphs
- Cascade trigger index
- Cascade engine for propagation
- Aggregation functions (SUM, AVG, LATEST, etc.)
- Decay and time-window logic
**Stream System** (Ephemeral Coordination)
- Stream producers (database triggers, external sources)
- Stream consumers (processors, database writers)
- Stream routing and message passing
- **NOT persistent storage** - streams are coordination, not persistence
- Loop-back pattern (stream → external processor → database update)
- Example use cases:
- Insert → stream → Tessera embedding → update entity
- Scheduled check → stream → web scraper → update if changed
- Database trigger → stream → external ML model → enrich data
**Server & Runtime**
- HTTP/WebSocket server (axum)
- Request coordination (async operations)
- Compute runtime (Lua and WASM)
- Custom function registry
- Connection management
**Import/Export**
- Data import from CSV, JSON, JSONL
- Batch operations
- Dependency resolution during import
- Position initialization strategies
**Indexing**
- **Hyperbolic HNSW**: HNSW adapted for hyperbolic distance (geometric similarity)
- **Vector Indexes**: HNSW/IVF-PQ for raw vector similarity (independent of hyperbolic positioning)
- **Graph Indexes**: Vertex-centric and label indexes for traversal optimization
- **Temporal Indexes**: Time-range indexes for time series queries
- **Global Position Index**: All entity positions for exact geometric queries
- **Type Indexes**: Entity type → entity list (in-memory DashMap)
- **Cache Layers**: Entity cache (50K), property cache (10K) with LRU eviction
**Integration Layer**
- DataSource implementation for HyperQL
- Geometric executor (hyperbolic distance operations)
- Vector executor (similarity operations with Tessera integration)
- Router for multi-collection access
- Schema engine for validation
#### What Hyperspatial Should NOT Own (Delegate to Manifold)
**Storage Primitives** - Use Manifold instead of custom implementations:
- ~~Custom vector storage with 5 redb databases~~ → manifold-vectors
- ~~Custom time series storage with columnar encoding~~ → manifold-timeseries
- ~~Custom graph storage with forward/reverse indexes~~ → manifold-graph
- ~~Complex edge properties system~~ → Simple bool + f32 from manifold-graph
- ~~Separate database files for core, properties, edges, etc.~~ → Column families
**Data to Remove**
- Problematic ML-based hyperbolic embedding code
- Redundant storage abstractions
- Overcomplicated edge property system
- Multiple database manager instances
#### Workspace Structure
```
hyperspatial/
├── src/ # Core hyperbolic database library
│ ├── hyperbolic/ # Positioning, distance, HNSW
│ ├── cascade/ # Cascade system
│ ├── streams/ # Stream coordination
│ ├── compute/ # Lua/WASM runtime (keep existing structure)
│ ├── coordination/ # Request coordination
│ ├── import/ # Import system
│ ├── query/ # HyperQL integration
│ └── lib.rs
├── crates/
│ ├── hyperspatial-cli/ # CLI tool with REPL, client, formatters
│ └── hyperspatial-server/ # Server binary (HTTP/WebSocket/gRPC)
├── examples/
├── tests/
└── Cargo.toml
```
**Crate Organization:**
- **hyperspatial-server**: Moved from `src/server/` to separate crate
- HTTP server (axum)
- WebSocket server
- gRPC server (keep existing implementation)
- Server configuration
- Request handlers
- **hyperspatial-cli**: Exists separately, already has comprehensive structure
- Commands: query, import, export, schema, describe, server, REPL
- Client: HTTP, WebSocket, protocol
- Formatters: JSON, CSV, table
- REPL: completion, highlighting, hints
**Dependencies:**
- Manifold (storage)
- HyperQL (queries)
- Tessera (optional, for native embedding generation)
---
### Haematite (Reactive Notebooks)
**Status:** Future work, not part of initial redesign
**Responsibility:** Mathematical thinking at Rust speed
Will integrate with the ecosystem once Hyperspatial redesign is complete.
---
## Architecture Diagrams
### Component Dependency Graph
```
┌─────────────┐
│ Haematite │ (Future - reactive notebooks)
└──────┬──────┘
│
├─────────────┬──────────────┬──────────────┐
│ │ │ │
v v v v
┌──────────┐ ┌─────────────┐ ┌─────────┐ ┌──────────┐
│ HyperQL │ │ Hyperspatial│ │ Tessera │ │ Manifold │
└──────────┘ └──────┬──────┘ └─────────┘ └──────────┘
│
├──────────┬──────────┐
│ │ │
v v v
┌──────────┐ ┌─────────┐ ┌──────────┐
│ HyperQL │ │ Tessera │ │ Manifold │
└──────────┘ └─────────┘ └──────────┘
Legend:
- Top layer depends on all below
- Hyperspatial uses all three primitives
- HyperQL, Tessera, Manifold are independent
```
### Storage Layer Consolidation
**Before (Current - Problematic):**
```
Collection "documents"
├── Shard 0000/
│ ├── core.redb (1 DB)
│ ├── properties.redb (1 DB)
│ ├── edges.redb (1 DB)
│ ├── edge_properties.redb (1 DB)
│ ├── measures.redb (1 DB)
│ ├── trajectories.redb (1 DB)
│ └── vectors/
│ ├── dense_vectors.redb (1 DB)
│ ├── sparse_vectors.redb (1 DB)
│ ├── multi_vectors.redb (1 DB)
│ ├── binary_vectors.redb (1 DB)
│ └── timeseries_vectors.redb (1 DB)
└── Total: 11 separate database files per shard
Issues:
- 11 file descriptors per shard
- Separate transaction logs
- No concurrent writes across types
- Complex coordination
```
**After (Redesign - Using Manifold):**
```
Collection "documents"
├── Shard 0000/
│ └── shard.manifold (1 ColumnFamilyDatabase)
│ ├── CF: entities (entity metadata)
│ ├── CF: properties (key-value properties)
│ ├── CF: edges (graph relationships)
│ ├── CF: measures (cascade values)
│ ├── CF: trajectories (position history)
│ ├── CF: semantic_vectors (VectorTable<768>)
│ ├── CF: code_vectors (VectorTable<512>)
│ ├── CF: sparse_vectors (SparseVectorTable)
│ └── CF: multi_vectors (MultiVectorTable)
└── Total: 1 database file with N column families
Benefits:
- 1 file descriptor per shard (+ N CF file descriptors)
- Shared WAL and transaction infrastructure
- Concurrent writes across column families
- Simplified coordination
- ~0.5ms commit latency via WAL
```
---
## Key Design Decisions
### 1. Entity IDs Across Tables
**Question:** Can the same entity ID be used across multiple vector tables?
**Answer:** Yes. Each VectorTable is an independent Manifold table with its own key space. Entity "doc_123" can exist in:
- `semantic_vectors` (VectorTable<768>)
- `code_vectors` (VectorTable<512>)
- `sparse_vectors` (SparseVectorTable)
When retrieving embeddings:
```rust
// Same entity ID, different tables
let semantic = semantic_vectors.get("doc_123")?;
let code = code_vectors.get("doc_123")?;
let sparse = sparse_vectors.get("doc_123")?;
```
### 2. Combining Time Series and Vectors (Trajectories)
**Question:** How do we store position history (17D vectors over time)?
**Answer:** Use manifold-timeseries with a custom value type:
```rust
#[derive(Serialize, Deserialize)]
struct PositionSnapshot {
graph_position: [f32; 17],
embedding_position: [f32; 17],
property_position: [f32; 17],
}
// TimeSeriesTable with composite key and custom value
let trajectories: TimeSeriesTable<(u64, &str), PositionSnapshot> =
TimeSeriesTable::new("trajectories")?;
// Store snapshot
trajectories.write(
(timestamp_nanos, entity_id),
PositionSnapshot { graph_position, embedding_position, property_position }
)?;
// Query range
for snapshot in trajectories.range(entity_id, start_time, end_time)? {
let (timestamp, positions) = snapshot?;
// All three positions available
}
```
### 3. Edge Properties Simplification
**Decision:** Adopt manifold-graph's simple model: `bool` (active/passive) + `f32` (weight/confidence)
**Rationale:**
- Covers 95% of use cases
- Simple and fast
- Aligns with Manifold's design
- For rare complex cases, store additional metadata in separate properties table
**Migration:**
- Remove complex edge property system from Hyperspatial
- Use GraphTable from manifold-graph
- Store edge metadata separately if needed
### 4. Compiled Queries vs. Hand-Written Functions
**Decision:** Compiled queries are for user-defined databases, not Hyperspatial internals
**Rationale:**
- Internal operations (HNSW updates, cascade propagation) are already in Rust
- No benefit to adding a compilation step for code we control
- Compiled queries valuable for users creating custom Manifold-based databases
- Example: Financial tick database with nanosecond-precision ingestion
**Use Cases for Compiled Queries:**
- Domain-specific databases (finance, IoT, analytics)
- Custom access patterns without writing Rust
- Type-safe query interfaces for applications
- Zero parsing overhead for known queries
### 5. Schema System
**Two-Mode Approach:**
**Mode 1: Inferred Schemas (Compiled Queries)**
- Queries imply schema through usage
- Type conflicts cause Rust compile errors
- Compiler optionally generates schema JSON
- No separate schema definition needed
**Mode 2: Explicit Schemas (Runtime Queries)**
- Define schema for validation
- HyperQL runtime enforces at query time
- Compiled queries validated at compile time
- Schema can be generated from compiled queries or defined explicitly
**Bridge:** Compiled queries can generate schema files that runtime queries consume
---
## Migration Strategy
### Phase 1: Preparation & Analysis
**Goals:**
1. Audit existing Hyperspatial code to identify what to keep vs. remove
2. Create component interface definitions
3. Design new module structure
4. Write comprehensive tests for core functionality to preserve
**Deliverables:**
- Interface definitions (traits, types)
- Test suite for hyperbolic operations
- Module map (old → new structure)
**Estimated Effort:** 1-2 weeks
---
### Phase 2: Manifold Storage Layer
**Goals:**
1. Replace all redb usage with Manifold column families
2. Implement storage adapters using manifold domain crates
3. Maintain API compatibility for upper layers
**Substeps:**
**2.1: Core Entity Storage**
- Replace CoreStorage with ColumnFamily for entities
- Implement in-memory type index (already exists, keep)
- Use EntityId as string key, serialize Entity as value
- Test: entity CRUD operations
**2.2: Properties Storage**
- Use ColumnFamily with composite key: `(entity_id, property_name)`
- Test: property operations, bulk updates
**2.3: Graph Storage**
- Replace edges.redb with manifold-graph GraphTable
- Simplify edge properties to bool + f32
- Test: edge CRUD, bidirectional queries
**2.4: Vector Storage**
- Create separate VectorTable instances per vector type/dimension
- Tables: semantic_vectors (VectorTable<768>), code_vectors (VectorTable<512>), etc.
- Use manifold-vectors instead of custom implementation
- Test: vector CRUD, bulk operations
**2.5: Time Series Storage**
- Use manifold-timeseries for measures
- Custom PositionSnapshot type for trajectories
- Test: time series queries, range scans
**2.6: Router Refactoring**
- Update Router to use ColumnFamilyDatabase
- Maintain Collection abstraction
- Update MetaIndex for routing
- Keep cache layers (entity cache, property cache)
- Test: multi-collection access, sharding
**Deliverables:**
- Complete storage layer using Manifold
- All tests passing
- Performance benchmarks showing equal or better performance
**Estimated Effort:** 4-6 weeks
---
### Phase 3: HyperQL Refactoring
**Goals:**
1. Make HyperQL general-purpose (remove Hyperspatial dependencies)
2. Move Hyperspatial-specific logic to Hyperspatial
3. Create hyperql-compile crate
**Substeps:**
**3.1: Core Query Language**
- Ensure parser, compiler, optimizer are backend-agnostic
- DataSource trait should work with any backend
- Remove Hyperspatial-specific assumptions
- Test: query parsing, compilation, optimization
**3.2: Type System Consolidation**
- Centralize entity, property, value types in HyperQL
- Hyperspatial imports and extends these types
- Test: type conversions, validation
**3.3: Execution Plan Extensions**
- Add geometric operation nodes to execution plans
- Add vector operation nodes
- Plans describe operations, don't execute them
- Test: plan generation, serialization
**3.4: HyperQL-Compile Crate**
- Create new crate in hyperQL/crates/hyperql-compile
- Implement type inference engine
- Implement Rust code generator using `quote` crate
- Generate functions using Manifold APIs
- Test: simple SELECT queries → Rust code
**Deliverables:**
- General-purpose HyperQL library
- hyperql-compile crate with basic code generation
- Clear separation from Hyperspatial
**Estimated Effort:** 3-4 weeks
---
### Phase 4: Hyperspatial Rebuild
**Goals:**
1. Rebuild Hyperspatial on Manifold + HyperQL foundation
2. Focus on core value: hyperbolic operations
3. Remove unnecessary complexity
4. Implement DataSource for HyperQL integration
**Substeps:**
**4.1: Hyperbolic Core**
- Keep: Distance calculations, position learning, trajectory analysis
- Remove: Problematic ML-based embedding code
- Update: Use manifold-vectors for position storage
- Test: Distance calculations, positioning algorithms
**4.2: HNSW Index**
- Adapt existing HNSW for Manifold storage
- Use manifold-vectors for position vectors
- Optimize for hyperbolic distance
- Test: k-NN queries, index building
**4.3: Cascade System**
- Keep: Dependency graphs, trigger index, engine
- Update: Use new storage layer
- Simplify: Remove unnecessary abstractions
- Test: Cascade propagation, aggregations
**4.4: Stream System**
- Implement ephemeral stream coordination
- Database triggers → streams
- External processors ↔ streams
- Loop-back to database
- Test: Stream workflows, producer/consumer patterns
**4.5: Server & Coordination**
- HTTP/WebSocket API using axum
- Request coordination for async ops
- Compute runtime (Lua/WASM)
- Test: API endpoints, concurrent requests
**4.6: HyperQL Integration**
- Implement DataSource trait using new storage
- Geometric executor for hyperbolic operations
- Vector executor with optional Tessera integration
- Test: HyperQL queries against Hyperspatial
**4.7: Import System**
- CSV, JSON, JSONL readers
- Batch operations using Manifold's bulk APIs
- Position initialization strategies
- Test: Large dataset imports
**Deliverables:**
- Clean, focused Hyperspatial implementation
- All core features working with new architecture
- Performance meeting or exceeding current system
**Estimated Effort:** 6-8 weeks
---
### Phase 5: Tessera Integration
**Goals:**
1. Add native embedding generation to HyperQL
2. Enable auto-embedding in Hyperspatial
3. Maintain Tessera as standalone (no changes to Tessera itself)
**Substeps:**
**5.1: HyperQL Embedding Functions**
- Add embedding functions to HyperQL: `embed_dense()`, `embed_multi()`, etc.
- Functions call Tessera when available
- Graceful degradation if Tessera not present
- Test: Embedding function parsing, compilation
**5.2: Hyperspatial Auto-Embedding**
- Configure Tessera models for collections
- Auto-embed on INSERT if configured
- Store embeddings using manifold-vectors
- Test: Auto-embedding workflows
**5.3: Compiled Query Integration**
- Generate code that calls Tessera APIs
- Type-safe embedding generation in compiled queries
- Test: Compiled queries with embeddings
**Deliverables:**
- Seamless Tessera integration
- Native embedding generation in queries
- Optional feature flag for Tessera dependency
**Estimated Effort:** 2-3 weeks
---
### Phase 6: Testing & Optimization
**Goals:**
1. Comprehensive test coverage
2. Performance optimization
3. Documentation
4. Examples
**Substeps:**
**6.1: Testing**
- Unit tests for all modules
- Integration tests for workflows
- Performance benchmarks
- Stress tests
- Test: All functionality under load
**6.2: Performance**
- Profile hot paths
- Optimize HNSW operations
- Tune cache sizes
- Optimize stream throughput
- Benchmark: Compare to current system
**6.3: Documentation**
- Architecture documentation
- API documentation
- Query language guide
- Migration guide for users
- Examples for common patterns
**Deliverables:**
- Production-ready system
- Complete documentation
- Performance validated
**Estimated Effort:** 3-4 weeks
---
## Implementation Phases Timeline
| Phase 1: Preparation | 1-2 weeks | None | Interface definitions, test suite, module map |
| Phase 2: Manifold Storage | 4-6 weeks | Phase 1 | Complete storage layer on Manifold |
| Phase 3: HyperQL Refactor | 3-4 weeks | Phase 1 | General-purpose HyperQL + compile crate |
| Phase 4: Hyperspatial Rebuild | 6-8 weeks | Phases 2, 3 | Clean Hyperspatial implementation |
| Phase 5: Tessera Integration | 2-3 weeks | Phase 4 | Native embedding generation |
| Phase 6: Testing & Optimization | 3-4 weeks | Phases 2-5 | Production-ready system |
**Total Estimated Duration:** 19-27 weeks (4.5-6 months)
---
### What Gets Deleted or Reorganized
### From Hyperspatial
**Storage Implementations (Replace with Manifold):**
- `src/persistence/storage/core.rs` → Use ColumnFamily with UUID keys
- `src/persistence/storage/properties.rs` → Use ColumnFamily with composite (UUID, property_name) keys
- `src/persistence/storage/edges.rs` → Use manifold-graph with UUID vertices
- `src/persistence/storage/edge_properties.rs` → Simplify to bool + f32 in manifold-graph
- `src/persistence/storage/measures.rs` → Use manifold-timeseries
- `src/persistence/storage/trajectories.rs` → Use manifold-timeseries with custom PositionSnapshot type
- `src/persistence/vectors/storage/*` → Use manifold-vectors with separate tables per vector type
- `src/persistence/timeseries/storage/*` → Use manifold-timeseries
**Server Code (Move to hyperspatial-server crate):**
- `src/server/mod.rs`
- `src/server/http.rs`
- `src/server/websocket.rs`
- `src/server/grpc.rs` (keep gRPC implementation)
- `src/server/config.rs`
- `src/server/generated/*` (gRPC generated code)
**Problematic ML Code:**
- `src/hyperbolic/learning/embeddings.rs` (if problematic)
- `src/hyperbolic/forces/*` (if using ML-based positioning)
- Any gradient descent or ML training code that didn't work
**Redundant Abstractions:**
- Multiple DatabaseManager instances
- Complex edge property serialization
- Custom vector compression (use Manifold's)
- Duplicate type definitions between Hyperspatial and HyperQL
### From HyperQL
**Hyperspatial-Specific Code:**
- Hard-coded references to Hyperspatial types
- Execution logic that belongs in Hyperspatial
- Any storage-specific assumptions
---
## What Gets Kept
### From Hyperspatial
**Core Value (Hyperbolic Space):**
- `src/hyperbolic/geometry/*` - Distance calculations, coordinate systems
- `src/hyperbolic/hyperboloid/*` - Hyperboloid model operations
- `src/hyperbolic/trajectory/*` - Drift, velocity, acceleration analysis
- `src/hyperbolic/properties/*` - Property fingerprinting and similarity
- `src/persistence/indices/hnsw.rs` - HNSW index for hyperbolic space
- `src/persistence/indices/global.rs` - Global position index
**Cascade System:**
- `src/cascade/*` - Complete cascade system
**Stream System:**
- `src/streams/*` - Complete stream coordination system
**Compute Runtime (Keep existing structure, audit for cleanup):**
- `src/compute/mod.rs` - Compute layer coordination
- `src/compute/lua/` - Lua runtime (6 files, well-organized)
- `mod.rs` - Module definition
- `runtime.rs` - Lua VM initialization
- `sandbox.rs` - Sandboxing and security
- `api_database.rs` - Database API bindings
- `api_hyperbolic.rs` - Hyperbolic operations API
- `functions.rs` - Custom function registry
- `src/compute/wasm/` - WASM runtime (5 files, well-organized)
- `mod.rs` - Module definition
- `runtime.rs` - WASM VM initialization
- `memory.rs` - Memory management
- `modules.rs` - Module loading
- `host_functions.rs` - Host function bindings
**Coordination:**
- `src/coordination/*` - Request coordination (keep as-is)
**Import:**
- `src/import/*` - Import system (update to use Manifold APIs)
**Caching:**
- Entity cache, property cache (in Router)
- Type index (in-memory DashMap)
### From HyperQL
**All Core Components:**
- Parser, AST, compiler, optimizer
- Type checker, validator
- Execution plan definitions
- DataSource trait
---
## Success Criteria
### Functional Requirements
- [ ] All current Hyperspatial features working with new architecture
- [ ] HyperQL queries execute correctly
- [ ] Compiled queries generate valid Rust code
- [ ] Cascade system propagates correctly
- [ ] Stream system coordinates workflows
- [ ] Import system handles large datasets
- [ ] HNSW index provides sub-millisecond queries
### Performance Requirements
- [ ] Query latency ≤ current system (22-70μs for k-NN)
- [ ] WAL commits ~0.5ms (10x improvement over current)
- [ ] Cache hit rates ≥ 90% after warmup
- [ ] Import throughput ≥ current system
- [ ] Memory usage ≤ 2x current system
### Code Quality
- [ ] Module boundaries clear and enforced
- [ ] No circular dependencies
- [ ] Test coverage > 80%
- [ ] All public APIs documented
- [ ] Examples for common patterns
### Architecture
- [ ] Components independently usable
- [ ] Clean dependency graph (no cycles)
- [ ] Manifold used throughout (no redb)
- [ ] HyperQL general-purpose (no Hyperspatial deps)
- [ ] Hyperspatial focused on hyperbolic operations
---
## Risk Analysis
### Technical Risks
**Risk 1: Performance Regression**
- **Mitigation:** Comprehensive benchmarking at each phase
- **Fallback:** Optimize hot paths, tune Manifold configuration
**Risk 2: Unforeseen Manifold Limitations**
- **Mitigation:** Early proof-of-concept for critical operations
- **Fallback:** Extend Manifold if needed, or workaround
**Risk 3: Complex Migration Logic**
- **Mitigation:** Greenfield approach (no migration needed)
- **Fallback:** N/A (no production data)
**Risk 4: Timeline Underestimation**
- **Mitigation:** Phased approach with clear milestones
- **Fallback:** Adjust scope or timeline based on progress
### Resource Risks
**Risk 1: Single Developer**
- **Mitigation:** Clear documentation, frequent commits
- **Fallback:** Bring in help if needed
**Risk 2: Scope Creep**
- **Mitigation:** Strict adherence to design document
- **Fallback:** Defer non-essential features to future versions
---
## Open Questions
1. **Vector Table Naming:** How to dynamically create/manage named vector tables?
- **Proposal:** Router maintains registry of vector table names → VectorTable instances
2. **Schema Storage:** Where to persist schema definitions?
- **Proposal:** Use Manifold column family for schema metadata
3. **HNSW Serialization:** How to persist HNSW index efficiently?
- **Current:** Custom format, works well
- **Proposal:** Keep current approach, store in Manifold column family
4. **Compiled Query Distribution:** How do users distribute compiled queries?
- **Proposal:** Generate Rust crate that users add as dependency
5. **Haematite Integration:** When and how to integrate?
- **Proposal:** Phase 7 (future work), after Hyperspatial stable
---
## Appendix A: Module Structure
### Hyperspatial Module Map (Actual Implementation - Phase 2)
**Status**: Phase 2A Complete (Week 1) - Storage scaffolding done
**Branch**: storage-rewrite
```
hyperspatial/
├── src/
│ ├── lib.rs # Public API
│ ├── error.rs # Error types
│ │
│ ├── types/ # Core domain types (UNCHANGED)
│ │ ├── mod.rs # Type exports
│ │ ├── entity.rs # Entity, EntityId, Edge, PropertyValue, MeasureValue
│ │ └── value.rs # Value enum for schema-aware storage
│ │
│ ├── storage/ # NEW: Manifold-based storage layer
│ │ ├── mod.rs # Module root (165 lines)
│ │ ├── collection.rs # ManifoldCollection wrapper (380 lines)
│ │ ├── config.rs # StorageConfig with presets (282 lines)
│ │ ├── error.rs # Storage-specific errors (104 lines)
│ │ ├── entities.rs # Entity CRUD (TODO: Week 2)
│ │ ├── properties.rs # Property CRUD (TODO: Week 2)
│ │ ├── graph.rs # manifold-graph wrapper (TODO: Week 3)
│ │ ├── vectors.rs # manifold-vectors wrapper (TODO: Week 4)
│ │ ├── timeseries.rs # manifold-timeseries wrapper (TODO: Week 5)
│ │ ├── measures.rs # Cascade value storage (TODO: Week 5)
│ │ ├── index.rs # Master routing index (TODO: Week 2)
│ │ └── utils.rs # Helper functions
│ │
│ ├── _old_persistence/ # REFERENCE: Old redb-based implementation
│ │ └── [100+ files moved here - 528 redb API references]
│ │
│ ├── hyperbolic/ # Core hyperbolic operations (KEEP)
│ │ ├── mod.rs
│ │ ├── geometry/ # Geometric operations
│ │ │ ├── mod.rs
│ │ │ ├── distance.rs # Distance calculations
│ │ │ ├── coordinates.rs # Coordinate systems
│ │ │ └── manifold.rs # Manifold operations
│ │ ├── hyperboloid/ # Hyperboloid model
│ │ │ ├── mod.rs
│ │ │ ├── operations.rs # Core operations
│ │ │ └── visualization.rs # Visualization helpers
│ │ ├── trajectory/ # Trajectory analysis
│ │ │ ├── mod.rs
│ │ │ ├── drift.rs # Drift detection
│ │ │ ├── velocity.rs # Velocity calculations
│ │ │ └── acceleration.rs # Acceleration analysis
│ │ └── properties/ # Property-based operations
│ │ ├── mod.rs
│ │ ├── fingerprint.rs # Property fingerprinting
│ │ └── similarity.rs # Property similarity
│ │
│ ├── index/ # Indexing systems
│ │ ├── mod.rs
│ │ ├── hyperbolic_hnsw.rs # HNSW for hyperbolic positions
│ │ ├── vector_hnsw.rs # HNSW for raw vectors (new)
│ │ ├── graph_index.rs # Graph traversal indexes (new)
│ │ ├── temporal_index.rs # Time series indexes (new)
│ │ ├── global.rs # Global position index
│ │ └── type_index.rs # Entity type index (DashMap)
│ │
│ ├── cascade/ # Cascade system (keep existing)
│ │ ├── mod.rs
│ │ ├── engine_core.rs # Cascade engine
│ │ ├── dependencies.rs # Dependency graph
│ │ ├── triggers.rs # Trigger index
│ │ ├── aggregation.rs # Aggregation functions
│ │ └── decay.rs # Decay functions
│ │
│ ├── streams/ # Stream coordination (keep existing structure)
│ │ ├── mod.rs
│ │ ├── core/ # Core stream logic
│ │ ├── producers/ # Producer implementations
│ │ ├── consumers/ # Consumer implementations
│ │ ├── processors/ # Stream processors
│ │ └── triggers/ # Trigger system
│ │
│ ├── query/ # HyperQL integration (KEEP)
│ │ ├── mod.rs
│ │ ├── hyperql/ # HyperQL-specific
│ │ │ ├── datasource.rs # DataSource implementation
│ │ │ ├── geometric.rs # Geometric executor
│ │ │ ├── vector.rs # Vector executor
│ │ │ └── conversion.rs # Type conversions
│ │ ├── api/ # Query API
│ │ ├── compiled/ # Compiled query support
│ │ └── runtime/ # Runtime query support
│ │
│ ├── compute/ # Custom functions (keep existing structure)
│ │ ├── mod.rs
│ │ ├── lua/ # Lua runtime (6 files)
│ │ │ ├── mod.rs
│ │ │ ├── runtime.rs
│ │ │ ├── sandbox.rs
│ │ │ ├── api_database.rs
│ │ │ ├── api_hyperbolic.rs
│ │ │ └── functions.rs
│ │ └── wasm/ # WASM runtime (5 files)
│ │ ├── mod.rs
│ │ ├── runtime.rs
│ │ ├── memory.rs
│ │ ├── modules.rs
│ │ └── host_functions.rs
│ │
│ ├── coordination/ # Async coordination (keep as-is)
│ │ ├── mod.rs
│ │ └── request_coordinator.rs
│ │
│ └── import/ # Import system (keep existing)
│ ├── mod.rs
│ ├── config_parser.rs
│ ├── executor.rs
│ ├── readers/
│ │ ├── csv_reader.rs
│ │ ├── json_reader.rs
│ │ └── jsonl_reader.rs
│ └── position_init.rs
│
└── crates/
├── hyperspatial-server/ # Server binary (NEW - moved from src/server)
│ ├── src/
│ │ ├── main.rs # Server entrypoint
│ │ ├── http.rs # HTTP API
│ │ ├── websocket.rs # WebSocket API
│ │ ├── grpc.rs # gRPC API
│ │ ├── config.rs # Server configuration
│ │ └── handlers.rs # Request handlers
│ └── Cargo.toml
│
└── hyperspatial-cli/ # CLI tool (EXISTS - comprehensive)
├── src/
│ ├── main.rs
│ ├── client/ # HTTP/WebSocket/Protocol
│ ├── commands/ # All CLI commands
│ │ ├── query.rs
│ │ ├── import/
│ │ ├── export.rs
│ │ ├── schema/
│ │ ├── describe.rs
│ │ ├── server.rs
│ │ └── repl.rs
│ ├── config/ # Configuration
│ ├── formatter/ # Output formatters
│ └── repl/ # REPL implementation
└── Cargo.toml
```
**Key Changes from Original Plan**:
1. **Storage Module**: Actual implementation uses clean separation:
- Domain types stay in `types/` (Entity, PropertyValue, etc.)
- Storage-specific code in `storage/` (errors, config, operations)
- No duplication between modules
2. **Old Code Preserved**: `_old_persistence/` contains all 528 redb references
as reference material (not to be copied)
3. **Configuration System**: `StorageConfig` with 5 presets (Default, Small,
Large, ReadHeavy, WriteHeavy) and builder pattern
4. **Module Organization**: Clear TODO markers for each week of Phase 2:
- Week 2: entities.rs, properties.rs, index.rs
- Week 3: graph.rs (manifold-graph integration)
- Week 4: vectors.rs (manifold-vectors integration)
- Week 5: timeseries.rs, measures.rs (manifold-timeseries integration)
- Week 6: Router integration, testing
**Reference Documents**:
- Design: `docs/NEW_STORAGE_ARCHITECTURE.md` (737 lines)
- Implementation plan: `.project/phases/phase-2/REWRITE_APPROACH.md` (315 lines)
- Progress tracking: `.project/phases/phase-2/WEEK1_PROGRESS.md` (281 lines)
```
### HyperQL Module Map (Refactored)
```
hyperQL/
├── src/
│ ├── lib.rs # Public API
│ ├── error.rs # Error types
│ │
│ ├── ast/ # Abstract syntax tree
│ │ ├── mod.rs
│ │ ├── query.rs # Query structures
│ │ ├── expression.rs # Expressions
│ │ ├── literal.rs # Literals
│ │ ├── function.rs # Functions
│ │ ├── predicate.rs # Predicates
│ │ ├── geometric.rs # Geometric operations
│ │ ├── vector.rs # Vector operations
│ │ ├── graph.rs # Graph patterns
│ │ └── schema.rs # Schema definitions
│ │
│ ├── parser/ # Parser
│ │ ├── mod.rs
│ │ ├── lexer.rs # Tokenization
│ │ ├── grammar.rs # Grammar rules
│ │ └── combinators.rs # Parser combinators
│ │
│ ├── compiler/ # Compiler
│ │ ├── mod.rs
│ │ ├── planner.rs # Query planning
│ │ ├── expression.rs # Expression compilation
│ │ └── select.rs # SELECT compilation
│ │
│ ├── optimizer/ # Optimizer
│ │ ├── mod.rs
│ │ ├── rules.rs # Optimization rules
│ │ ├── predicate_pushdown.rs
│ │ ├── constant_folding.rs
│ │ └── projection_pushdown.rs
│ │
│ ├── executor/ # Executor
│ │ ├── mod.rs
│ │ ├── plan_executor.rs # Plan execution
│ │ ├── datasource.rs # DataSource trait
│ │ └── expression_eval.rs # Expression evaluation
│ │
│ ├── types/ # Type system
│ │ ├── mod.rs
│ │ ├── entity.rs # Entity types
│ │ ├── property.rs # Property types
│ │ └── value.rs # Value types
│ │
│ └── validator/ # Validation
│ ├── mod.rs
│ ├── type_checker.rs # Type checking
│ └── semantic.rs # Semantic validation
│
└── crates/
└── hyperql-compile/ # Query compiler
├── src/
│ ├── lib.rs # Library API
│ ├── cli.rs # CLI tool
│ ├── inference.rs # Type inference
│ ├── codegen.rs # Code generation
│ └── schema.rs # Schema generation
└── Cargo.toml
```
---
## Appendix B: Example Code
**Note:** Detailed code examples have been moved to a separate document `HYPERSPATIAL_EXAMPLES.md` to keep this design document focused. See that document for:
- Vector storage with multiple named vectors and UUID keys
- Trajectory storage with custom PositionSnapshot type
- Compiled query examples with generated Rust code
- Fixed-width key usage patterns
---
## Appendix C: Performance Targets
### Current Performance Baseline
| Metric | Current | Target (Post-Redesign) |
|--------|---------|------------------------|
| k-NN query (HNSW) | 22-70μs | ≤ 70μs (maintain) |
| SQL query (entity retrieval) | 145-300μs | ≤ 300μs (maintain) |
| Commit latency | ~5ms (redb) | ~0.5ms (Manifold WAL) |
| Cache hit rate | 90-95% | ≥ 90% (maintain) |
| Entity cache size | 50K capacity, 9.5MB | Maintain |
| Property cache size | 10K capacity, 19.1MB | Maintain |
| Import throughput | ~10K entities/sec | ≥ 10K entities/sec |
| File handles per shard | 11 | ~1 + N CFs |
### New Performance Opportunities
- **WAL Commits:** 10x faster (0.5ms vs 5ms)
- **Concurrent Writes:** Column families enable parallel writes to different data types
- **Vector Operations:** manifold-vectors optimizations (zero-copy, const generic dimensions)
- **Time Series:** manifold-timeseries delta encoding and compression
- **Graph Traversal:** manifold-graph bidirectional indexes
- **Compiled Queries:** Zero parsing overhead for known queries
---
## Coding Conventions
To maintain consistency with Manifold's architecture and ensure clean, maintainable code:
### Architectural Standards
- **Follow Manifold's patterns**: Study manifold-vectors, manifold-timeseries, manifold-graph for structure
- **Separation of concerns**: One module, one responsibility
- **Folder modules**: Use `mod.rs` files for organization and visibility control only
- `mod.rs` contains documentation and `pub mod` declarations
- `mod.rs` should NOT contain implementation code
- Implementation goes in separate files (e.g., `distance.rs`, `trajectory.rs`)
- **Fixed-width keys**: Use UUID (16 bytes) for all entity IDs, enables zero-copy deserialization
### File Organization
- **Keep files under 600 lines**: Ideal target, 800 lines maximum
- **Split large modules**: Break into logical sub-modules when exceeding limits
- **Clear naming**: File names should describe their single responsibility
- **Related code together**: Group related functions/types in same file
### Development Process
- **Dev notes in plan**: Record decisions, challenges, solutions in phase plan documents
- **Mark progress**: Update task lists and journals in `.project/` directory
- **Test as you go**: Write tests alongside implementation
- **Benchmark critical paths**: Measure performance for storage and indexing operations
### Code Quality
- **Documentation**: Public APIs must have doc comments
- **Error handling**: Use Result types, provide context in errors
- **Type safety**: Leverage Rust's type system (const generics, traits, etc.)
- **No unsafe**: Avoid unsafe code unless absolutely necessary and well-justified
### Module Structure Example
```
// Good - mod.rs for organization
pub mod distance;
pub mod positioning;
pub mod trajectory;
use distance::HyperbolicDistance;
pub use positioning::PositionLearner;
// Bad - mod.rs with implementation
pub mod distance;
pub fn calculate_hyperbolic_distance(...) { // Don't do this!
// Implementation here
}
```
---
## Next Steps
1. **Review & Approval:** Review this design document with stakeholders
2. **Branch Creation:** Create `redesign` branch from `main`
3. **Create Phase Plans:** Break each phase into detailed task documents
4. **Phase 1 Start:** Begin preparation phase
- Audit existing code (detailed review of what to keep/modify/delete)
- Write interface definitions
- Create test suite
5. **Weekly Progress:** Track progress against timeline
6. **Milestone Reviews:** Review at end of each phase
7. **Dev Journals**: Maintain running notes in `.project/phases/` directory
---
## Conclusion
This redesign transforms Hyperspatial from a monolithic system with complex, duplicated storage code into a modular ecosystem with clear component boundaries. By consolidating on Manifold for storage, making HyperQL general-purpose, and focusing Hyperspatial on its core value (hyperbolic space operations), we create a platform that is:
- **Simpler:** Less code to maintain, cleaner abstractions
- **Faster:** Manifold's WAL, column families, and domain optimizations
- **More Composable:** Each component independently useful
- **More Maintainable:** Clear boundaries, no circular dependencies
- **More Powerful:** Compiled queries, native embeddings, reactive notebooks
The estimated 4.5-6 month timeline is aggressive but achievable with focused effort and the greenfield advantage of no migration constraints.
**Status:** Ready for implementation
**Next Action:** Begin Phase 1 (Preparation & Analysis)