DataSpool - Efficient Data Bundling System
DataSpool is a high-performance data bundling library that eliminates filesystem overhead by concatenating multiple items (cards, images, binary blobs) into a single indexed .spool file with SQLite-based metadata and vector embeddings.
Features
- π¦ Efficient Bundling - Single file storage with byte-offset index
- π Random Access - Direct seeks to any item without scanning
- π Vector Search - SQLite-backed embeddings for semantic retrieval
- π Metadata Storage - Rich metadata with full-text search (FTS5)
- π Multiple Variants - Cards (compressed CML), images, binary blobs
- πΎ Compact Format - Minimal overhead, optimal for thousands of items
- π Type-Safe - Rust type safety with serde serialization
Quick Start
Writing a Spool
use ;
// Create spool builder
let mut builder = new;
// Add entries
builder.add_entry;
builder.add_entry;
// Write to file
builder.write_to_file?;
Reading from a Spool
use SpoolReader;
// Open spool
let reader = open?;
// Read specific entry
let data = reader.read_entry?; // Read first entry
println!;
// Iterate entries
for in reader.iter_entries.enumerate
Persistent Vector Store
use ;
// Create persistent store
let mut store = new?;
// Add document with embedding
let doc_ref = DocumentRef ;
let embedding = vec!; // Example embedding vector
store.add_document_ref?;
// Search by vector similarity
let query_vector = vec!;
let results = store.search?;
for result in results
Spool Format
File Structure
.spool file:
βββββββββββββββββββββββββββββββ
β Magic: "SP01" (4 bytes)β
β Version: 1 (1 byte) β
β Card Count (4 bytes)β
β Index Offset (8 bytes)β
βββββββββββββββββββββββββββββββ€
β Card 0 Data β
β Card 1 Data β
β ... β
β Card N Data β
βββββββββββββββββββββββββββββββ€
β Index: β
β [offset0, len0] β
β [offset1, len1] β
β ... β
β [offsetN, lenN] β
βββββββββββββββββββββββββββββββ
.db file (SQLite):
βββββββββββββββββββββββββββββββ
β documents table: β
β - id β
β - file_path β
β - source β
β - metadata (JSON) β
β - spool_offset β
β - spool_length β
βββββββββββββββββββββββββββββββ€
β embeddings table: β
β - doc_id β
β - vector (BLOB) β
βββββββββββββββββββββββββββββββ
Format Details
- Magic Number:
SP01(4 bytes) - Identifies spool format - Version:
1(1 byte) - Format version - Card Count: Number of entries in spool (u32)
- Index Offset: Byte offset where index starts (u64)
- Index: Array of
[offset: u64, length: u64]pairs
Architecture
βββββββββββββββ
β DataCard β (compressed CML)
ββββββββ¬βββββββ
β
v
βββββββββββββββ ββββββββββββββββ
β SpoolBuilderβββββ>β .spool file β
βββββββββββββββ ββββββββββββββββ
β
v
ββββββββββββββββ
β SpoolReader β
ββββββββ¬ββββββββ
β
βββββββββββββββββββββ΄ββββββββββββββββββββ
v v
ββββββββββββββββββββ ββββββββββββββββββ
β PersistentVector β β .db (SQLite) β
β Store β<ββββββββββββββββββ - documents β
ββββββββββββββββββββ β - embeddings β
ββββββββββββββββββ
Use Cases
1. Knowledge Base Archival
Bundle thousands of documentation cards into a single file:
// Build spool from cards
let mut builder = new;
for card in documentation_cards
builder.write_to_file?;
// Create vector index
let mut store = new?;
for in embeddings.iter.enumerate
2. Image Dataset Storage
Store image collections with metadata:
let mut builder = new;
for image_path in image_paths
builder.write_to_file?;
3. Binary Blob Archival
Archive arbitrary binary data with fast random access:
// Write blobs
let mut builder = new;
builder.add_entry;
builder.add_entry;
builder.write_to_file?;
// Random access read
let reader = open?;
let blob1_data = reader.read_entry?; // Direct access, no scan
Performance
Benchmark results (3,309 items, Rust stdlib documentation):
| Operation | Time | Notes |
|---|---|---|
| Build spool | ~200ms | Writing all items + index |
| Read single item | <1ms | Direct byte offset seek |
| Read all items | ~50ms | Sequential read |
| SQLite insert (1 doc) | ~0.5ms | With embedding |
| Vector search (10 results) | ~5ms | Cosine similarity + index |
Comparison to Alternatives
| Approach | Read Speed | Storage Overhead | Random Access |
|---|---|---|---|
| Individual files | Slow (3,309 inodes) | High (4KB/file) | Yes |
| tar archive | Slow (must scan) | Low | No |
| zip archive | Fast | Medium | Yes |
| DataSpool | Fast | Minimal | Yes |
DataSpool Advantages
- No compression overhead - Items pre-compressed by BytePunch
- Instant random access - Direct byte offset, no central directory scan
- Integrated vector DB - Semantic search without external tools
- Minimal format - Simple binary format, easy to parse
Dependencies
[]
= "0.1.0"
= "0.1.0" # For compressed item decompression
Dependency Graph
dataspool
βββ bytepunch (compression)
βββ rusqlite (SQLite database)
βββ serde (serialization)
βββ thiserror (error handling)
Features
Default
Basic spool read/write and persistent vector store.
Optional: async
Async APIs for non-blocking I/O:
[]
= { = "0.1.0", = ["async"] }
use AsyncSpoolReader;
let reader = open.await?;
let data = reader.read_entry.await?;
Installation
Add to Cargo.toml:
[]
= "0.1.0"
Or with async support:
[]
= { = "0.1.0", = ["async"] }
Testing
# Run all tests
# Run with logging
RUST_LOG=debug
# Test specific module
Examples
See examples/ directory:
build_spool.rs- Build a spool from filesread_spool.rs- Read entries from a spoolvector_search.rs- Semantic search with embeddings
Run with:
Roadmap
- Image-based spools with EXIF metadata
- Audio/video spool variants
- Compression statistics per entry
- Incremental spool updates (append-only mode)
- Multi-threaded indexing
- Memory-mapped I/O for large spools
- Network streaming protocol
History
Extracted from the SAM (Societal Advisory Module) project, where it provides the spool bundling system for knowledge base archival.
License
MIT - See LICENSE for details.
Author
Magnus Trent magnus@blackfall.dev
Links
- GitHub: https://github.com/Blackfall-Labs/dataspool-rs
- Docs: https://docs.rs/dataspool
- Crates.io: https://crates.io/crates/dataspool
- SAM Project: https://github.com/Blackfall-Labs/sam