Repartir: Sovereign AI-Grade Distributed Computing
Repartir is a pure Rust library for distributed execution across CPUs, GPUs, and remote machines. Built on the Iron Lotus Framework (Toyota Way principles for systems programming) and validated by the certeza testing methodology.
Table of Contents
- Features
- Installation
- Quick Start
- Feature Flags
- Architecture
- Iron Lotus Framework
- Testing (Certeza Methodology)
- Sovereign AI Principles
- Roadmap
- Contributing
- Documentation
- Academic Foundations
- Comparison with Existing Systems
- License
- Acknowledgments
Features
- ✅ 100% Rust, Zero C/C++: True digital sovereignty through complete auditability
- ✅ Memory Safety Guaranteed: Provably safe via RustBelt formal verification
- ✅ Work-Stealing Scheduler: Based on Blumofe & Leiserson (1999)
- ✅ Priority-Based Execution: High, Normal, and Low priority queues
- ✅ Fault Tolerance: Task retry, timeout handling, graceful failure
- ✅ Supply Chain Security: Dependency pinning, binary signing, license enforcement
- ✅ Iron Lotus Quality: ≥95% coverage target, ≥80% mutation score, formal verification
- ✅ Certeza Testing: Three-tiered testing (sub-second → minutes → hours)
Installation
# From crates.io
# From source
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
= { = "1.35", = ["rt-multi-thread", "macros"] }
Basic Example
use ;
async
Run the Example
Comprehensive v1.1 Showcase
See all v1.1 features in action:
# Generate TLS certificates first
# Run comprehensive showcase
Demonstrates:
- ✅ CPU executor with work-stealing (48 workers, Blumofe & Leiserson algorithm)
- ✅ GPU detection (NVIDIA RTX 4090, 2048 compute units)
- ✅ TLS encryption (certificate-based auth, TLS 1.3)
- ✅ Priority scheduling (High/Normal/Low queues)
- ✅ Parallel speedup (3.82x with 4 workers)
- ✅ Fault tolerance (graceful error handling)
v2.0 Data Integration Features
Repartir v2.0 introduces Parquet checkpoint storage and data-locality aware scheduling for enterprise-grade distributed computing.
Checkpoint with Parquet Storage (v2.0)
Enable persistent checkpointing with Apache Parquet format (5-10x compression vs JSON):
use ;
use TaskState;
async
Storage efficiency:
- SNAPPY compression: 5-10x smaller checkpoint files
- Columnar format: Optimized for analytical queries
- Backward compatible: Reads both
.parquetand.jsonformats
Run the example:
Locality-Aware Scheduling (v2.0)
Minimize network transfers by scheduling tasks on workers that already have required data:
use ;
async
Scheduling intelligence:
- Affinity scoring:
(data_items_present / total_data_items) - Automatic optimization: Prefers workers with matching data
- Real-time metrics: Track locality hit rate (0.0 to 1.0)
Performance benefits:
- Reduces network I/O for data-intensive workloads
- Improves cache utilization
- Enables efficient batch processing
Tensor Operations with SIMD (v2.0)
High-performance tensor operations with automatic SIMD optimization:
use ;
use Backend;
async
SIMD acceleration:
- Leverages trueno library's AVX2/AVX-512 optimizations
- 2-8x speedup vs scalar operations on modern CPUs
- Automatic backend selection based on CPU features
- f32 precision for optimal SIMD register utilization
Operations:
- Element-wise:
add(),sub(),mul(),div() - Dot product:
dot() - Scalar:
scalar_mul()
Run the example:
Feature Flags
Repartir supports multiple execution backends via feature flags:
[]
# CPU only (default)
= "0.1"
# With GPU support (v1.1+)
= { = "0.1", = ["gpu"] }
# With remote execution (v1.1+)
= { = "0.1", = ["remote"] }
# With TLS encryption (v1.1+)
= { = "0.1", = ["remote-tls"] }
# With Parquet checkpointing (v2.0+)
= { = "0.1", = ["checkpoint"] }
# With SIMD tensor operations (v2.0+)
= { = "0.1", = ["tensor"] }
# All features
= { = "0.1", = ["full"] }
GPU Executor (v1.1+)
The GPU executor uses wgpu for cross-platform GPU compute:
use GpuExecutor;
use Executor;
async
Supported backends:
- Vulkan (Linux/Windows/Android)
- Metal (macOS/iOS)
- DirectX 12 (Windows)
- WebGPU (browsers)
Note (v1.1): GPU detection and initialization only. Binary task execution on GPU requires compute shader compilation (v1.2+ with rust-gpu).
TLS Encryption (v1.1+)
Secure remote execution with TLS/SSL encryption using rustls:
use TlsConfig;
async
Security features:
- TLS 1.3 end-to-end encryption
- Certificate-based authentication
- Perfect forward secrecy
- MITM attack protection
Generate test certificates:
⚠️ WARNING: The included certificate generator creates self-signed certificates for TESTING ONLY. For production, use certificates from a trusted CA (Let's Encrypt, DigiCert, etc.).
Messaging Patterns (v1.1+)
Advanced messaging for distributed coordination with PUB/SUB and PUSH/PULL patterns:
Publish-Subscribe (PUB/SUB)
One publisher broadcasts to multiple subscribers:
use ;
async
Use cases: Event notifications, logging, monitoring, real-time updates
Push-Pull (PUSH/PULL)
Work distribution with automatic load balancing:
use ;
async
Use cases: Work queues, job scheduling, pipeline processing, task distribution
Run examples:
Architecture
Repartir follows a clean, layered architecture:
┌─────────────────────────────────────┐
│ Pool (High-Level API) │
├─────────────────────────────────────┤
│ Scheduler (Priority Queue + Work │
│ Stealing - Blumofe & Leiserson) │
├─────────────────────────────────────┤
│ Executor Backends │
│ ┌────────┐ ┌────────┐ ┌────────┐│
│ │ CPU │ │ GPU │ │ Remote ││
│ │(v1.0) │ │(v1.1+) │ │(v1.1+) ││
│ └────────┘ └────────┘ └────────┘│
└─────────────────────────────────────┘
Iron Lotus Framework
Repartir embodies Toyota Production System principles:
Genchi Genbutsu (現地現物 - "Go and See")
- Radical Transparency: Every operation traceable from API → scheduler → executor
- No Black Boxes: 100% pure Rust, zero opaque C/C++ libraries
- AST-Level Inspection: Code structure visible via pmat
Jidoka (自働化 - "Automation with Human Touch")
- Automated Quality Gates: CI enforces clippy, rustfmt, tests, coverage
- Andon Cord: Build fails immediately on any defect
- No Manual Checks: Machines verify before humans review
Kaizen (改善 - "Continuous Improvement")
- Technical Debt Grading: TDG score must never decrease
- Ratchet Effect: Each PR improves or maintains quality
- Five Whys: Root cause analysis for all incidents
Muda (無駄 - "Waste Elimination")
- No Overproduction: Zero YAGNI features
- No Waiting: Fast compilation with sccache
- No Transportation: Zero-copy data flow, single language
- No Defects: EXTREME TDD with mutation testing
Testing (Certeza Methodology)
Repartir uses a three-tiered testing approach:
Tier 1: ON-SAVE (Sub-Second)
Fast feedback for flow state:
- Unit tests (21 tests)
cargo checkcargo clippycargo fmt
Target: < 3 seconds
Tier 2: ON-COMMIT (1-5 Minutes)
Comprehensive pre-commit gate:
- All tests (21 unit + 4 property + 4 doc = 29 tests)
- Property-based tests (proptest)
- Coverage analysis (target ≥95%)
- Documentation tests
- Security audit (cargo-audit, cargo-deny)
Target: 1-5 minutes
Tier 3: ON-MERGE (Hours)
Exhaustive validation:
- Mutation testing (cargo-mutants, target ≥80%)
- Formal verification (Kani, for critical paths)
- Extended fuzzing
- Performance benchmarks
Target: 1-6 hours (run overnight or in CI)
Test Results (v1.0)
✓ 21 unit tests (0.10s)
✓ 4 property-based tests (1.92s)
✓ 4 documentation tests (0.23s)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
29 tests PASSED
Sovereign AI Principles
Digital Sovereignty Requirements
- Auditability: Trace execution from user API → hardware instruction
- Supply Chain Independence: Deterministic rebuild from source
- No Foreign Dependencies: Zero opaque binary blobs
- Memory Safety Guarantees: Provable absence of vulnerabilities
Supply Chain Security
- Dependency Pinning:
Cargo.lockcommitted and reviewed - License Enforcement: Only MIT/Apache-2.0/BSD allowed (via cargo-deny)
- Binary Signing: ed25519 signatures for distributed binaries
- Audit Trail:
cargo treelogged in CI
Memory Safety (NSA/CISA Mandate)
Per NSA/CISA joint guidance on memory-safe languages:
- ✅ Rust provides compile-time memory safety guarantees
- ✅ RustBelt formal verification proves soundness
- ✅
#![deny(unsafe_code)]in v1.0 (no unsafe code) - ✅ Eliminates buffer overflows, use-after-free, data races
Roadmap
v1.0: Sovereign Foundation (Current)
- ✅ CPU executor with work-stealing scheduler
- ✅ Priority-based task scheduling
- ✅ High-level Pool API
- ✅ Comprehensive testing (29 tests)
- ✅ Iron Lotus quality gates
- ✅ Supply chain security
v1.1: Production Hardening (Complete)
- ✅ GPU executor skeleton (wgpu detection, v1.2 for rust-gpu compute)
- ✅ Remote executor (TCP transport, length-prefixed bincode protocol)
- ✅ TLS encryption (rustls, certificate-based auth, TLS 1.3)
- ✅ Performance benchmarks vs Ray/Dask (5 benchmark suites)
- ✅ Mutation testing ≥85% (framework + documentation)
- ✅ Comprehensive Makefile (tier1/tier2/tier3, coverage enforcement)
- ✅ bashrs purification (POSIX-compliant shell code)
- ✅ Advanced messaging patterns (PUB/SUB, PUSH/PULL)
v2.0: Data Integration (In Progress)
- ✅ Parquet Checkpoint Storage (Phase 1): SNAPPY compression, 5-10x size reduction
- ✅ Data-Locality Tracking (Phase 1): DataLocationTracker with batch affinity queries
- ✅ Affinity-Based Scheduling (Phase 2): Automatic locality-aware task assignment
- ✅ Locality Metrics (Phase 2): Real-time hit rate tracking (0.0 to 1.0)
- ✅ Tensor Operations (Phase 3): SIMD-accelerated operations via trueno (2-8x speedup)
- ✅ Performance Benchmarks (Phase 2): Locality scheduling validated (<1ms overhead)
- Advanced ML patterns (pipeline/tensor parallelism) - Phase 4
v3.0: Enterprise & Cloud
- RDMA support (low-latency networking)
- Multi-tenant isolation
- Kubernetes operator
- FIPS 140-2 compliance mode
Contributing
Contributions welcome! Please ensure:
- All tests pass:
make tier2 - Coverage ≥95%:
make coverage(when configured) - Clippy passes:
cargo clippy -- -D warnings - Code formatted:
cargo fmt - No SATD comments (TODO without ticket number)
See Iron Lotus Code Review Framework for detailed guidelines.
Documentation
- Full Specification (1,500 lines, 10+ academic citations)
- API Documentation (run
cargo doc --open) - Examples - Hands-on demonstrations
Academic Foundations
Repartir is grounded in peer-reviewed research:
- Jung et al. (2017) - RustBelt: Formal verification of Rust's safety
- Blumofe & Leiserson (1999) - Provably optimal work-stealing
- Chandra & Toueg (1996) - Unreliable failure detectors
- NSA/CISA (2023) - Memory-safe languages guidance
- Pereira et al. (2017) - Energy efficiency of Rust vs C++
See specification for complete citations.
Comparison with Existing Systems
| Feature | Repartir (v1.1) | Ray | Dask |
|---|---|---|---|
| Language | Rust | Python | Python |
| C Dependencies | Zero* | Many | Some |
| GPU Support | Yes (wgpu) | Limited | No |
| Work Stealing | Yes | No | Yes |
| Fault Tolerance | Yes | Yes | Limited |
| Memory Safety | Guaranteed | Runtime | Runtime |
| Binary Execution | Yes | No | No |
| Remote Execution | Yes (TCP) | Yes | Yes |
*Note: rustls (used for TLS) currently depends on aws-lc-rs (C). Pure Rust alternatives under evaluation for v1.2+.
License
MIT License - see LICENSE file for details.
Acknowledgments
- Iron Lotus Framework: Toyota Production System for systems programming
- Certeza Project: Asymptotic test effectiveness methodology
- PAIML Stack: trueno, aprender, paiml-mcp-agent-toolkit, bashrs
- Rust Community: rust-gpu, wgpu, tokio, and the broader ecosystem
Built with the Iron Lotus Framework Quality is not inspected in; it is built in.