Video world models break when the camera moves too far, revisits old areas, or tries to maintain scene structure over long rollouts. MosaicMem fixes this with explicit spatial memory -- a geometry-aware memory stack that lifts observed patches into 3D, retrieves them at novel viewpoints, and injects them back into the generation loop.
This Rust implementation (mosaicmem) provides the complete memory-side pipeline: streaming 3D reconstruction, patch-level spatial storage, view-conditioned retrieval, geometric alignment via Warped RoPE / Warped Latent, and autoregressive generation plumbing. Ships with deterministic synthetic backends so the full pipeline runs end-to-end without external model weights.
How it works
Keyframe ──> Depth ──> Lift to 3D ──> Memory Store
|
Target Pose ──> Query Memory ──> Retrieve + Align ──> Condition Diffusion ──> Frame
| |
Warped RoPE Warped Latent
- Depth estimation -- extract per-pixel depth from keyframes
- 3D lifting -- unproject patches into a shared world-space point cloud
- Spatial storage -- index patches in a kd-tree for fast nearest-neighbor lookup
- View-conditioned retrieval -- given a target camera pose, find the most relevant stored patches
- Geometric alignment -- apply Warped RoPE (attention-level) and Warped Latent (feature-level) to align retrieved context with the target view
- Conditioned generation -- inject aligned memory into a diffusion denoising loop to produce geometry-consistent frames
Architecture
mosaicmem/
src/
attention/ # PRoPE, Warped RoPE, Warped Latent, memory cross-attention
camera/ # Intrinsics, poses, trajectory I/O
diffusion/ # Backbone, DDPM scheduler, VAE (synthetic stubs)
geometry/ # Depth estimation, point cloud fusion, projection
memory/ # Mosaic memory store, retrieval, spatial manipulation
pipeline/ # Autoregressive generation, config, inference loop
tests/
integration.rs # 120+ assertions across all modules
meaningful_end_to_end.rs # Memory-conditioned revisit consistency test
| Module | Purpose | Key types |
|---|---|---|
attention |
Position encoding & cross-attention | WarpedRoPE, WarpedLatent, PRoPE, MemoryCrossAttention |
camera |
Camera model & trajectories | CameraPose, CameraIntrinsics, CameraTrajectory |
diffusion |
Generation backbone (synthetic) | SyntheticBackbone, DDPMScheduler, SyntheticVAE |
geometry |
3D reconstruction primitives | DepthEstimator, PointCloud, StreamingFusion |
memory |
Spatial memory system | MosaicMemoryStore, MemoryRetrieval, SpatialManipulation |
pipeline |
End-to-end generation | AutoregressivePipeline, PipelineConfig |
Quick start
Prerequisites
- Rust 1.85+ (edition 2024)
- No GPU or external weights required
Run
Use as a library
# Cargo.toml
[]
= { = "https://github.com/AbdelStark/mosaicmem.git" }
use ;
use SyntheticDepthEstimator;
use ;
use AutoregressivePipeline;
use PipelineConfig;
// Build a camera trajectory
let trajectory = circle;
// Configure and run the pipeline
let config = PipelineConfig ;
let mut pipeline = new;
let frames = pipeline.generate?;
CLI reference
mosaicmem <COMMAND>
Commands:
generate Generate video frames from a camera trajectory
demo Run with synthetic data (no models required)
inspect Show memory/geometry statistics for a trajectory
visualize Display memory store diagnostics
splice Merge two memory stores with spatial layout
export-ply Export reconstructed point cloud to PLY
show-config Dump or load pipeline configuration as JSON
bench Run pipeline performance benchmark
Examples
# Generate frames from a trajectory file
# Inspect memory coverage across a trajectory
# Export the reconstructed 3D scene
# Benchmark throughput
# Splice two scenes side-by-side
Key features
- Streaming 3D fusion -- incrementally builds a point cloud as new keyframes arrive, no batch reconstruction needed
- kd-tree spatial index -- O(log n) nearest-neighbor retrieval over millions of stored patches via kiddo
- Warped RoPE -- applies geometric-aware rotary position encoding so attention respects 3D spatial relationships, not just token order
- Warped Latent -- feature-level alignment that reprojects retrieved patches into the target view's latent space
- PRoPE -- progressive rotary position encoding with temporal decay for long-horizon consistency
- Autoregressive windowing -- generates arbitrarily long videos with overlapping windows and memory carryover
- Adaptive keyframe selection -- automatically selects keyframes based on camera motion magnitude
- Memory manipulation -- splice, transform, and compose spatial memory stores for scene editing
- Parallel computation -- leverages rayon for multi-core depth estimation, projection, and fusion
- Deterministic synthetic backends -- full pipeline testable without GPU or model weights
- Zero unsafe code -- pure safe Rust throughout
Testing
The test suite covers:
- Camera pose composition and trajectory generation
- Depth estimation and 3D point cloud construction
- Memory store insertion, retrieval, and spatial queries
- Warped RoPE / Warped Latent geometric alignment correctness
- Full pipeline end-to-end generation with memory conditioning
- Revisit consistency: generated frames at revisited viewpoints are closer to the originally observed scene than unconditioned generation
Performance
The synthetic pipeline (no neural network inference) on a single core:
| Resolution | Frames | Steps | Time |
|---|---|---|---|
| 64x64 | 32 | 5 | ~0.2s |
| 128x128 | 32 | 5 | ~0.8s |
| 256x256 | 16 | 50 | ~4s |
Memory store scales to millions of patches with sub-millisecond retrieval via kd-tree spatial indexing.
Contributing
Contributions are welcome. Please open an issue first to discuss what you'd like to change.
Citation