dataspool-rs 0.3.0

Efficient data bundling system with indexed .spool files and SQLite vector database
Documentation

DataSpool - Efficient Data Bundling System

DataSpool is a high-performance data bundling library that eliminates filesystem overhead by concatenating multiple items (cards, images, binary blobs) into a single indexed .spool file with SQLite-based metadata and vector embeddings.

Features

  • πŸ“¦ Efficient Bundling - Single file storage with byte-offset index
  • πŸš€ Random Access - Direct seeks to any item without scanning
  • πŸ” Vector Search - SQLite-backed embeddings for semantic retrieval
  • πŸ“Š Metadata Storage - Rich metadata with full-text search (FTS5)
  • πŸ”„ Multiple Variants - Cards (compressed CML), images, binary blobs
  • πŸ’Ύ Compact Format - Minimal overhead, optimal for thousands of items
  • πŸ” Type-Safe - Rust type safety with serde serialization

Quick Start

Writing a Spool

use dataspool::{SpoolBuilder, SpoolEntry};

// Create spool builder
let mut builder = SpoolBuilder::new();

// Add entries
builder.add_entry(SpoolEntry {
    id: "item1".to_string(),
    data: b"Item 1 data".to_vec(),
});

builder.add_entry(SpoolEntry {
    id: "item2".to_string(),
    data: b"Item 2 data".to_vec(),
});

// Write to file
builder.write_to_file("data.spool")?;

Reading from a Spool

use dataspool::SpoolReader;

// Open spool
let reader = SpoolReader::open("data.spool")?;

// Read specific entry
let data = reader.read_entry(0)?; // Read first entry
println!("Item 0: {} bytes", data.len());

// Iterate entries
for (index, entry) in reader.iter_entries().enumerate() {
    let data = entry?;
    println!("Item {}: {} bytes", index, data.len());
}

Reading an Embedded Spool

Spools can be embedded within larger files (e.g., an Engram archive) and read directly without extraction. open_embedded() takes a base byte offset and adjusts all internal offsets so that read_card() seeks to the correct position within the host file:

use dataspool::SpoolReader;

// Open a spool stitched into a larger file at byte offset 4096.
let mut reader = SpoolReader::open_embedded("archive.eng", 4096)?;

// read_card() transparently seeks within the host file.
let card = reader.read_card(0)?;
println!("Card 0: {} bytes", card.len());

This enables consumers like Engram to stitch spool data inline during archive compilation, then serve card reads directly from the archive file β€” no temp extraction, no filesystem overhead.

Persistent Vector Store

use dataspool::{PersistentVectorStore, DocumentRef};

// Create persistent store
let mut store = PersistentVectorStore::new("vectors.db")?;

// Add document with embedding
let doc_ref = DocumentRef {
    id: "doc1".to_string(),
    file_path: "data.spool".to_string(),
    source: "web-scrape".to_string(),
    metadata: Some(r#"{"title": "Example"}"#.to_string()),
    spool_offset: Some(0),
    spool_length: Some(1024),
};

let embedding = vec![0.1, 0.2, 0.3, 0.4]; // Example embedding vector
store.add_document_ref(&doc_ref, &embedding)?;

// Search by vector similarity
let query_vector = vec![0.15, 0.25, 0.35, 0.45];
let results = store.search(&query_vector, 10)?;

for result in results {
    println!("ID: {}, Score: {:.3}", result.id, result.score);
}

Spool Format

File Structure

.spool file:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Magic: "SP01"       (4 bytes)β”‚
β”‚ Version: 1          (1 byte) β”‚
β”‚ Card Count          (4 bytes)β”‚
β”‚ Index Offset        (8 bytes)β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Card 0 Data                  β”‚
β”‚ Card 1 Data                  β”‚
β”‚ ...                          β”‚
β”‚ Card N Data                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Index:                       β”‚
β”‚   [offset0, len0]            β”‚
β”‚   [offset1, len1]            β”‚
β”‚   ...                        β”‚
β”‚   [offsetN, lenN]            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

.db file (SQLite):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ documents table:             β”‚
β”‚   - id                       β”‚
β”‚   - file_path                β”‚
β”‚   - source                   β”‚
β”‚   - metadata (JSON)          β”‚
β”‚   - spool_offset             β”‚
β”‚   - spool_length             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ embeddings table:            β”‚
β”‚   - doc_id                   β”‚
β”‚   - vector (BLOB)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Format Details

  • Magic Number: SP01 (4 bytes) - Identifies spool format
  • Version: 1 (1 byte) - Format version
  • Card Count: Number of entries in spool (u32)
  • Index Offset: Byte offset where index starts (u64)
  • Index: Array of [offset: u64, length: u32] pairs (12 bytes each)

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  DataCard   β”‚ (compressed CML)
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SpoolBuilder│────>β”‚  .spool file β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              v            v                v
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ Standalone β”‚ β”‚ Embedded β”‚  β”‚  .db (SQLite) β”‚
       β”‚ SpoolReaderβ”‚ β”‚ in .eng  β”‚  β”‚  - documents  β”‚
       β”‚  ::open()  β”‚ β”‚::open_   β”‚  β”‚  - embeddings β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚embedded()β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
                                            v
                                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                   β”‚ PersistentVector β”‚
                                   β”‚      Store       β”‚
                                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Standalone vs. Embedded

Mode Constructor Use Case
Standalone SpoolReader::open(path) Reading .spool files directly from disk
Embedded SpoolReader::open_embedded(path, offset) Reading spools stitched into a host file (e.g., Engram archives)

Both modes share the same read_card() / read_card_at() API. The embedded constructor adjusts internal byte offsets by the base offset so all seeks target the correct position within the host file.

Use Cases

1. Knowledge Base Archival

Bundle thousands of documentation cards into a single file:

// Build spool from cards
let mut builder = SpoolBuilder::new();
for card in documentation_cards {
    builder.add_entry(SpoolEntry {
        id: card.id,
        data: card.compressed_data,
    });
}
builder.write_to_file("rust-stdlib.spool")?;

// Create vector index
let mut store = PersistentVectorStore::new("rust-stdlib.db")?;
for (i, embedding) in embeddings.iter().enumerate() {
    store.add_document_ref(&DocumentRef {
        id: format!("card_{}", i),
        file_path: "rust-stdlib.spool".to_string(),
        spool_offset: Some(offsets[i]),
        spool_length: Some(lengths[i]),
        ...
    }, embedding)?;
}

2. Image Dataset Storage

Store image collections with metadata:

let mut builder = SpoolBuilder::new();

for image_path in image_paths {
    let data = std::fs::read(&image_path)?;
    builder.add_entry(SpoolEntry {
        id: image_path.file_stem().unwrap().to_string(),
        data,
    });
}

builder.write_to_file("images.spool")?;

3. Binary Blob Archival

Archive arbitrary binary data with fast random access:

// Write blobs
let mut builder = SpoolBuilder::new();
builder.add_entry(SpoolEntry { id: "blob1".into(), data: blob1 });
builder.add_entry(SpoolEntry { id: "blob2".into(), data: blob2 });
builder.write_to_file("blobs.spool")?;

// Random access read
let reader = SpoolReader::open("blobs.spool")?;
let blob1_data = reader.read_entry(0)?; // Direct access, no scan

Performance

Benchmark results (3,309 items, Rust stdlib documentation):

Operation Time Notes
Build spool ~200ms Writing all items + index
Read single item <1ms Direct byte offset seek
Read all items ~50ms Sequential read
SQLite insert (1 doc) ~0.5ms With embedding
Vector search (10 results) ~5ms Cosine similarity + index

Comparison to Alternatives

Approach Read Speed Storage Overhead Random Access
Individual files Slow (3,309 inodes) High (4KB/file) Yes
tar archive Slow (must scan) Low No
zip archive Fast Medium Yes
DataSpool Fast Minimal Yes

DataSpool Advantages

  • No compression overhead - Items pre-compressed by BytePunch
  • Instant random access - Direct byte offset, no central directory scan
  • Integrated vector DB - Semantic search without external tools
  • Minimal format - Simple binary format, easy to parse

Dependencies

[dependencies]

dataspool = "0.1.0"

bytepunch = "0.1.0"  # For compressed item decompression

Dependency Graph

dataspool
β”œβ”€β”€ bytepunch (compression)
β”œβ”€β”€ rusqlite (SQLite database)
β”œβ”€β”€ serde (serialization)
└── thiserror (error handling)

Features

Default

Basic spool read/write and persistent vector store.

Optional: async

Async APIs for non-blocking I/O:

[dependencies]

dataspool = { version = "0.1.0", features = ["async"] }

use dataspool::async_api::AsyncSpoolReader;

let reader = AsyncSpoolReader::open("data.spool").await?;
let data = reader.read_entry(0).await?;

Installation

Add to Cargo.toml:

[dependencies]

dataspool = "0.1.0"

Or with async support:

[dependencies]

dataspool = { version = "0.1.0", features = ["async"] }

Testing

# Run all tests

cargo test


# Run with logging

RUST_LOG=debug cargo test


# Test specific module

cargo test spool

cargo test persistent_store

Examples

See examples/ directory:

  • build_spool.rs - Build a spool from files
  • read_spool.rs - Read entries from a spool
  • vector_search.rs - Semantic search with embeddings

Run with:

cargo run --example build_spool

cargo run --example read_spool

cargo run --example vector_search

Roadmap

  • Image-based spools with EXIF metadata
  • Audio/video spool variants
  • Compression statistics per entry
  • Incremental spool updates (append-only mode)
  • Multi-threaded indexing
  • Memory-mapped I/O for large spools
  • Network streaming protocol

History

Extracted from the SAM (Societal Advisory Module) project, where it provides the spool bundling system for knowledge base archival.

License

MIT - See LICENSE for details.

Author

Magnus Trent magnus@blackfall.dev

Links