mosaicmem 0.1.0

Video world models break when the camera moves too far, revisits old areas, or tries to maintain scene structure over long rollouts. MosaicMem fixes this with explicit spatial memory -- a geometry-aware memory stack that lifts observed patches into 3D, retrieves them at novel viewpoints, and injects them back into the generation loop.

This Rust implementation (mosaicmem) provides the complete memory-side pipeline: streaming 3D reconstruction, patch-level spatial storage, view-conditioned retrieval, geometric alignment via Warped RoPE / Warped Latent, and autoregressive generation plumbing. Ships with deterministic synthetic backends so the full pipeline runs end-to-end without external model weights.

How it works

Keyframe ──> Depth ──> Lift to 3D ──> Memory Store
                                           |
Target Pose ──> Query Memory ──> Retrieve + Align ──> Condition Diffusion ──> Frame
                                     |            |
                                Warped RoPE   Warped Latent

Depth estimation -- extract per-pixel depth from keyframes
3D lifting -- unproject patches into a shared world-space point cloud
Spatial storage -- index patches in a kd-tree for fast nearest-neighbor lookup
View-conditioned retrieval -- given a target camera pose, find the most relevant stored patches
Geometric alignment -- apply Warped RoPE (attention-level) and Warped Latent (feature-level) to align retrieved context with the target view
Conditioned generation -- inject aligned memory into a diffusion denoising loop to produce geometry-consistent frames

Architecture

mosaicmem/
  src/
    attention/       # PRoPE, Warped RoPE, Warped Latent, memory cross-attention
    camera/          # Intrinsics, poses, trajectory I/O
    diffusion/       # Backbone, DDPM scheduler, VAE (synthetic stubs)
    geometry/        # Depth estimation, point cloud fusion, projection
    memory/          # Mosaic memory store, retrieval, spatial manipulation
    pipeline/        # Autoregressive generation, config, inference loop
  tests/
    integration.rs           # 120+ assertions across all modules
    meaningful_end_to_end.rs # Memory-conditioned revisit consistency test

Module	Purpose	Key types
`attention`	Position encoding & cross-attention	`WarpedRoPE`, `WarpedLatent`, `PRoPE`, `MemoryCrossAttention`
`camera`	Camera model & trajectories	`CameraPose`, `CameraIntrinsics`, `CameraTrajectory`
`diffusion`	Generation backbone (synthetic)	`SyntheticBackbone`, `DDPMScheduler`, `SyntheticVAE`
`geometry`	3D reconstruction primitives	`DepthEstimator`, `PointCloud`, `StreamingFusion`
`memory`	Spatial memory system	`MosaicMemoryStore`, `MemoryRetrieval`, `SpatialManipulation`
`pipeline`	End-to-end generation	`AutoregressivePipeline`, `PipelineConfig`

Quick start

Prerequisites

Rust 1.85+ (edition 2024)
No GPU or external weights required

Run

git clone https://github.com/AbdelStark/mosaicmem.git
cd mosaicmem
cargo test           # 122 tests, ~0.3s
cargo run -- demo --num-frames 16 --width 64 --height 64 --steps 5

Use as a library

# Cargo.toml
[dependencies]
mosaicmem = { git = "https://github.com/AbdelStark/mosaicmem.git" }

use mosaicmem::camera::{CameraPose, CameraTrajectory};
use mosaicmem::geometry::depth::SyntheticDepthEstimator;
use mosaicmem::memory::store::{MemoryConfig, MosaicMemoryStore};
use mosaicmem::pipeline::autoregressive::AutoregressivePipeline;
use mosaicmem::pipeline::config::PipelineConfig;

// Build a camera trajectory
let trajectory = CameraTrajectory::circle(num_frames, radius, height);

// Configure and run the pipeline
let config = PipelineConfig {
    width: 256,
    height: 256,
    steps: 50,
    window_size: 16,
    ..Default::default()
};
let mut pipeline = AutoregressivePipeline::new(config);
let frames = pipeline.generate(&trajectory, "a living room scene")?;

CLI reference

mosaicmem <COMMAND>

Commands:
  generate     Generate video frames from a camera trajectory
  demo         Run with synthetic data (no models required)
  inspect      Show memory/geometry statistics for a trajectory
  visualize    Display memory store diagnostics
  splice       Merge two memory stores with spatial layout
  export-ply   Export reconstructed point cloud to PLY
  show-config  Dump or load pipeline configuration as JSON
  bench        Run pipeline performance benchmark

Examples

# Generate frames from a trajectory file
cargo run -- generate --trajectory trajectory.json --output out/ --width 256 --height 256

# Inspect memory coverage across a trajectory
cargo run -- inspect --trajectory trajectory.json --coverage

# Export the reconstructed 3D scene
cargo run -- export-ply --trajectory trajectory.json --output scene.ply

# Benchmark throughput
cargo run -- bench --num-frames 64 --width 128 --height 128 --iterations 5

# Splice two scenes side-by-side
cargo run -- splice --trajectory-a scene1.json --trajectory-b scene2.json --layout horizontal

Key features

Streaming 3D fusion -- incrementally builds a point cloud as new keyframes arrive, no batch reconstruction needed
kd-tree spatial index -- O(log n) nearest-neighbor retrieval over millions of stored patches via kiddo
Warped RoPE -- applies geometric-aware rotary position encoding so attention respects 3D spatial relationships, not just token order
Warped Latent -- feature-level alignment that reprojects retrieved patches into the target view's latent space
PRoPE -- progressive rotary position encoding with temporal decay for long-horizon consistency
Autoregressive windowing -- generates arbitrarily long videos with overlapping windows and memory carryover
Adaptive keyframe selection -- automatically selects keyframes based on camera motion magnitude
Memory manipulation -- splice, transform, and compose spatial memory stores for scene editing
Parallel computation -- leverages rayon for multi-core depth estimation, projection, and fusion
Deterministic synthetic backends -- full pipeline testable without GPU or model weights
Zero unsafe code -- pure safe Rust throughout

Testing

cargo test                    # All 122 tests
cargo test --test integration # Integration suite
cargo test -- --nocapture     # With stdout output

The test suite covers:

Camera pose composition and trajectory generation
Depth estimation and 3D point cloud construction
Memory store insertion, retrieval, and spatial queries
Warped RoPE / Warped Latent geometric alignment correctness
Full pipeline end-to-end generation with memory conditioning
Revisit consistency: generated frames at revisited viewpoints are closer to the originally observed scene than unconditioned generation

Performance

The synthetic pipeline (no neural network inference) on a single core:

Resolution	Frames	Steps	Time
64x64	32	5	~0.2s
128x128	32	5	~0.8s
256x256	16	50	~4s

Memory store scales to millions of patches with sub-millisecond retrieval via kd-tree spatial indexing.

Contributing

Contributions are welcome. Please open an issue first to discuss what you'd like to change.

cargo fmt           # Format
cargo clippy        # Lint
cargo test          # Must pass

Citation

@article{mosaicmem2026,
  title   = {MosaicMem: Hybrid Spatial Memory for Controllable Video World Models},
  author  = {Wei Yu and Runjia Qian and Yumeng Li and Liquan Wang and Songheng Yin and
             Sri Siddarth Chakaravarthy P and Dennis Anthony and Yang Ye and Yidi Li and
             Weiwei Wan and Animesh Garg},
  journal = {arXiv preprint arXiv:2603.17117},
  year    = {2026}
}

License

MIT