nusy-arrow-git
Graph-native Git for Arrow RecordBatches — version, branch, merge, and diff knowledge graphs at sub-5ms latency.
Motivation
Traditional Git versions files. But when your data is a knowledge graph — millions of RDF-like triples stored as Arrow RecordBatches — serializing to files and back is a wasteful detour.
nusy-arrow-git operates directly on in-memory Arrow tables. Commits snapshot namespace-partitioned RecordBatches to Parquet. Checkouts restore them. Diffs compare triples, not lines. Merges detect semantic conflicts (same subject + predicate, different objects), not textual ones.
The result: version control that understands your data's structure, with commit+checkout cycles completing in under 5ms for 10,000 triples.
Features
Twelve git primitives, all operating on Arrow RecordBatches:
| Primitive | Description |
|---|---|
| commit | Snapshot graph state to per-namespace Parquet files |
| checkout | Restore graph state from a commit's snapshots |
| history | DAG traversal — log, ancestors, find_common_ancestor |
| refs | Mutable branches + immutable tags via RefsTable |
| diff | Object-level comparison — added/removed triples with full provenance |
| merge | 3-way merge with pluggable conflict resolution strategies |
| save | Crash-safe persistence via WAL + atomic Parquet writes |
| cherry_pick | Apply a single commit's changes onto a different HEAD |
| revert | Create an inverse commit that undoes a previous change |
| delete_branch | Remove a branch ref (cannot delete HEAD) |
| tags | Immutable ref pointers for release marking |
| object_store | GitObjectStore wrapper with configurable snapshot directory |
Quick Start
Add to your Cargo.toml:
[]
= "0.14"
= "0.14"
Create a store, add triples, commit, checkout, and diff:
use ;
use ;
Installation
Minimum Supported Rust Version (MSRV): Rust 2024 edition
As a workspace dependency:
# Root Cargo.toml
[]
= { = "crates/nusy-arrow-git" }
Dependencies: Relies on nusy-arrow-core for the ArrowGraphStore, Triple,
Namespace, and YLayer types. Uses Apache Arrow 55+ and Parquet 55+ for
columnar storage.
Core Concepts
"There Are No Files"
Arrow tables ARE the data. Parquet snapshots are the undo log. There is no serialize-to-text step, no line-based diffing, no merge conflict markers in files.
When you commit(), each namespace (world, work, research, self) is written as
a separate Parquet file. When you checkout(), those Parquet files are loaded
back into Arrow RecordBatches. The graph store's in-memory state is always the
source of truth; Parquet is the persistence layer.
┌─────────────┐ commit() ┌───────────────────────┐
│ ArrowGraph │──────────────→│ snapshots/{id}/ │
│ Store │ │ world.parquet │
│ (in-memory) │←──────────────│ work.parquet │
└─────────────┘ checkout() │ research.parquet │
└───────────────────────┘
WAL Crash Safety
The save() function uses a write-ahead log pattern:
- Write
_wal.jsonlisting namespaces to be saved - Write each namespace to
{name}.parquet.tmp - Atomic
rename()from.tmpto.parquet - Remove
_wal.json
If the process crashes mid-save, the WAL marker tells the next restore() that
the previous save was interrupted. Because writes use atomic rename, each Parquet
file is either fully old or fully new — never partial.
3-Way Merge with Conflict Detection
Merges find the common ancestor (via BFS on the commit DAG), then diff both
branches against it. A conflict occurs when two branches modify the same
(subject, predicate, namespace) tuple with different object values.
Four built-in resolution strategies:
| Strategy | Behavior |
|---|---|
Manual |
Return conflicts for caller to resolve |
Ours |
Always keep branch A's value |
Theirs |
Always keep branch B's value |
LastWriterWins |
Compare consolidated_at timestamps |
Custom strategies are supported via closure:
use ;
let strategy = Custom;
Performance
Benchmarked on DGX Spark with 10,000 triples:
| Operation | Target | Measured |
|---|---|---|
| Commit + Checkout | < 50ms | ~4.3ms |
| Save + Restore | < 100ms | ~4.3ms |
| Being awakening (M-119) | < 200ms | ~50ms |
| Batch add 10K triples | < 10ms | ~4.3ms |
| Single commit | < 25ms | ~3ms |
| Single checkout | < 25ms | ~1.3ms |
All operations are well under their performance gates, leaving headroom for graphs 10-100x larger.
Safety and Invariants
- CommitsTable is append-only. History is never rewritten. Even
revert()creates a new commit with inverse changes. - WAL guarantees. A crash during
save()cannot corrupt existing Parquet files. Atomic rename ensures each file is fully old or fully new. - Tag immutability. Once created, a tag's commit pointer cannot be changed
or deleted. Use
delete_branch()for mutable refs. - HEAD protection. The currently checked-out branch cannot be deleted.
- Merge commits record both parents. The commit DAG is a true DAG, not a linear chain.
- Schema versioning. Parquet files include a
nusy_schema_versionmetadata key. On checkout,normalize_to_current()adapts old schemas forward. - Thread safety. Individual operations are not internally synchronized.
Callers must ensure exclusive access to
GitObjectStoreduring mutations.
Panics
create_commitwill returnCommitError::Ioif the snapshot directory cannot be created (permissions, disk full).checkoutwill returnCommitError::NotFoundif the commit's snapshot directory is missing (e.g., manually deleted).revertreturnsRevertError::MergeCommitfor merge commits (multiple parents).
No function in this crate calls panic!() or unwrap() in library code.
See Also
- nusy-arrow-core — The Arrow graph store, triple schema, namespace partitioning, and Y-layer classification that nusy-arrow-git operates on.
- nusy-kanban — Arrow-native kanban CLI that uses
save_named_batches()/restore_named_batches()for crash-safe persistence. A real-world consumer of this crate's save primitives. - nusy-graph-review — Graph-native code review built on top of nusy-arrow-git's diff and merge primitives.
License
MIT — see LICENSE for details.