DataSpool - Efficient Data Bundling System
DataSpool is a high-performance data bundling library that eliminates filesystem overhead by concatenating multiple items (cards, images, binary blobs) into a single indexed .spool file with SQLite-based metadata and vector embeddings.
Features
- π¦ Efficient Bundling - Single file storage with byte-offset index
- π Random Access - Direct seeks to any item without scanning
- π Vector Search - SQLite-backed embeddings for semantic retrieval
- π Metadata Storage - Rich metadata with full-text search (FTS5)
- π Multiple Variants - Cards (compressed CML), images, binary blobs
- πΎ Compact Format - Minimal overhead, optimal for thousands of items
- π Type-Safe - Rust type safety with serde serialization
Quick Start
Writing a Spool
use ;
// Create spool builder
let mut builder = new;
// Add entries
builder.add_entry;
builder.add_entry;
// Write to file
builder.write_to_file?;
Reading from a Spool
use SpoolReader;
// Open spool
let reader = open?;
// Read specific entry
let data = reader.read_entry?; // Read first entry
println!;
// Iterate entries
for in reader.iter_entries.enumerate
Reading an Embedded Spool
Spools can be embedded within larger files (e.g., an Engram archive) and read directly without extraction. open_embedded() takes a base byte offset and adjusts all internal offsets so that read_card() seeks to the correct position within the host file:
use SpoolReader;
// Open a spool stitched into a larger file at byte offset 4096.
let mut reader = open_embedded?;
// read_card() transparently seeks within the host file.
let card = reader.read_card?;
println!;
This enables consumers like Engram to stitch spool data inline during archive compilation, then serve card reads directly from the archive file β no temp extraction, no filesystem overhead.
Persistent Vector Store
use ;
// Create persistent store
let mut store = new?;
// Add document with embedding
let doc_ref = DocumentRef ;
let embedding = vec!; // Example embedding vector
store.add_document_ref?;
// Search by vector similarity
let query_vector = vec!;
let results = store.search?;
for result in results
Spool Format
File Structure
.spool file:
βββββββββββββββββββββββββββββββ
β Magic: "SP01" (4 bytes)β
β Version: 1 (1 byte) β
β Card Count (4 bytes)β
β Index Offset (8 bytes)β
βββββββββββββββββββββββββββββββ€
β Card 0 Data β
β Card 1 Data β
β ... β
β Card N Data β
βββββββββββββββββββββββββββββββ€
β Index: β
β [offset0, len0] β
β [offset1, len1] β
β ... β
β [offsetN, lenN] β
βββββββββββββββββββββββββββββββ
.db file (SQLite):
βββββββββββββββββββββββββββββββ
β documents table: β
β - id β
β - file_path β
β - source β
β - metadata (JSON) β
β - spool_offset β
β - spool_length β
βββββββββββββββββββββββββββββββ€
β embeddings table: β
β - doc_id β
β - vector (BLOB) β
βββββββββββββββββββββββββββββββ
Format Details
- Magic Number:
SP01(4 bytes) - Identifies spool format - Version:
1(1 byte) - Format version - Card Count: Number of entries in spool (u32)
- Index Offset: Byte offset where index starts (u64)
- Index: Array of
[offset: u64, length: u32]pairs (12 bytes each)
Architecture
βββββββββββββββ
β DataCard β (compressed CML)
ββββββββ¬βββββββ
β
v
βββββββββββββββ ββββββββββββββββ
β SpoolBuilderβββββ>β .spool file β
βββββββββββββββ ββββββββββββββββ
β
ββββββββββββββΌβββββββββββββββββ
v v v
ββββββββββββββ ββββββββββββ ββββββββββββββββ
β Standalone β β Embedded β β .db (SQLite) β
β SpoolReaderβ β in .eng β β - documents β
β ::open() β β::open_ β β - embeddings β
ββββββββββββββ βembedded()β βββββββββ¬ββββββββ
ββββββββββββ β
v
ββββββββββββββββββββ
β PersistentVector β
β Store β
ββββββββββββββββββββ
Standalone vs. Embedded
| Mode | Constructor | Use Case |
|---|---|---|
| Standalone | SpoolReader::open(path) |
Reading .spool files directly from disk |
| Embedded | SpoolReader::open_embedded(path, offset) |
Reading spools stitched into a host file (e.g., Engram archives) |
Both modes share the same read_card() / read_card_at() API. The embedded constructor adjusts internal byte offsets by the base offset so all seeks target the correct position within the host file.
Use Cases
1. Knowledge Base Archival
Bundle thousands of documentation cards into a single file:
// Build spool from cards
let mut builder = new;
for card in documentation_cards
builder.write_to_file?;
// Create vector index
let mut store = new?;
for in embeddings.iter.enumerate
2. Image Dataset Storage
Store image collections with metadata:
let mut builder = new;
for image_path in image_paths
builder.write_to_file?;
3. Binary Blob Archival
Archive arbitrary binary data with fast random access:
// Write blobs
let mut builder = new;
builder.add_entry;
builder.add_entry;
builder.write_to_file?;
// Random access read
let reader = open?;
let blob1_data = reader.read_entry?; // Direct access, no scan
Performance
Benchmark results (3,309 items, Rust stdlib documentation):
| Operation | Time | Notes |
|---|---|---|
| Build spool | ~200ms | Writing all items + index |
| Read single item | <1ms | Direct byte offset seek |
| Read all items | ~50ms | Sequential read |
| SQLite insert (1 doc) | ~0.5ms | With embedding |
| Vector search (10 results) | ~5ms | Cosine similarity + index |
Comparison to Alternatives
| Approach | Read Speed | Storage Overhead | Random Access |
|---|---|---|---|
| Individual files | Slow (3,309 inodes) | High (4KB/file) | Yes |
| tar archive | Slow (must scan) | Low | No |
| zip archive | Fast | Medium | Yes |
| DataSpool | Fast | Minimal | Yes |
DataSpool Advantages
- No compression overhead - Items pre-compressed by BytePunch
- Instant random access - Direct byte offset, no central directory scan
- Integrated vector DB - Semantic search without external tools
- Minimal format - Simple binary format, easy to parse
Dependencies
[]
= "0.1.0"
= "0.1.0" # For compressed item decompression
Dependency Graph
dataspool
βββ bytepunch (compression)
βββ rusqlite (SQLite database)
βββ serde (serialization)
βββ thiserror (error handling)
Features
Default
Basic spool read/write and persistent vector store.
Optional: async
Async APIs for non-blocking I/O:
[]
= { = "0.1.0", = ["async"] }
use AsyncSpoolReader;
let reader = open.await?;
let data = reader.read_entry.await?;
Installation
Add to Cargo.toml:
[]
= "0.1.0"
Or with async support:
[]
= { = "0.1.0", = ["async"] }
Testing
# Run all tests
# Run with logging
RUST_LOG=debug
# Test specific module
Examples
See examples/ directory:
build_spool.rs- Build a spool from filesread_spool.rs- Read entries from a spoolvector_search.rs- Semantic search with embeddings
Run with:
Roadmap
- Image-based spools with EXIF metadata
- Audio/video spool variants
- Compression statistics per entry
- Incremental spool updates (append-only mode)
- Multi-threaded indexing
- Memory-mapped I/O for large spools
- Network streaming protocol
History
Extracted from the SAM (Societal Advisory Module) project, where it provides the spool bundling system for knowledge base archival.
License
MIT - See LICENSE for details.
Author
Magnus Trent magnus@blackfall.dev
Links
- GitHub: https://github.com/Blackfall-Labs/dataspool-rs
- Docs: https://docs.rs/dataspool
- Crates.io: https://crates.io/crates/dataspool
- SAM Project: https://github.com/Blackfall-Labs/sam