nusy-arrow-git 0.15.2

Graph-native git on Arrow RecordBatches — commit, checkout, merge, diff, blame, rebase, cherry-pick, revert
Documentation

nusy-arrow-git

Crates.io docs.rs CI License: MIT

Graph-native Git for Arrow RecordBatches — version, branch, merge, and diff knowledge graphs at sub-5ms latency.

Motivation

Traditional Git versions files. But when your data is a knowledge graph — millions of RDF-like triples stored as Arrow RecordBatches — serializing to files and back is a wasteful detour.

nusy-arrow-git operates directly on in-memory Arrow tables. Commits snapshot namespace-partitioned RecordBatches to Parquet. Checkouts restore them. Diffs compare triples, not lines. Merges detect semantic conflicts (same subject + predicate, different objects), not textual ones.

The result: version control that understands your data's structure, with commit+checkout cycles completing in under 5ms for 10,000 triples.

Features

Twelve git primitives, all operating on Arrow RecordBatches:

Primitive Description
commit Snapshot graph state to per-namespace Parquet files
checkout Restore graph state from a commit's snapshots
history DAG traversal — log, ancestors, find_common_ancestor
refs Mutable branches + immutable tags via RefsTable
diff Object-level comparison — added/removed triples with full provenance
merge 3-way merge with pluggable conflict resolution strategies
save Crash-safe persistence via WAL + atomic Parquet writes
cherry_pick Apply a single commit's changes onto a different HEAD
revert Create an inverse commit that undoes a previous change
delete_branch Remove a branch ref (cannot delete HEAD)
tags Immutable ref pointers for release marking
object_store GitObjectStore wrapper with configurable snapshot directory

Quick Start

Add to your Cargo.toml:

[dependencies]
nusy-arrow-git = "0.14"
nusy-arrow-core = "0.14"

Create a store, add triples, commit, checkout, and diff:

use nusy_arrow_core::{Namespace, Triple, YLayer};
use nusy_arrow_git::{
    CommitsTable, GitObjectStore, checkout, create_commit, diff,
};

fn triple(s: &str, p: &str, o: &str) -> Triple {
    Triple {
        subject: s.into(), predicate: p.into(), object: o.into(),
        graph: None, confidence: None, source_document: None,
        source_chunk_id: None, extracted_by: None, caused_by: None,
        derived_from: None, consolidated_at: None,
    }
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a git-aware graph store
    let mut obj = GitObjectStore::with_snapshot_dir("my_snapshots");
    let mut commits = CommitsTable::new();

    // Add a triple and commit
    obj.store.add_triple(&triple("alice", "knows", "bob"), Namespace::World, YLayer::Semantic)?;
    let c1 = create_commit(&obj, &mut commits, vec![], "Add alice→bob", "dev")?;

    // Add another triple and commit
    obj.store.add_triple(&triple("bob", "knows", "carol"), Namespace::World, YLayer::Semantic)?;
    let c2 = create_commit(&obj, &mut commits, vec![c1.commit_id.clone()], "Add bob→carol", "dev")?;

    // Diff between commits
    let changes = diff(&mut obj, &commits, &c1.commit_id, &c2.commit_id)?;
    println!("Added: {}, Removed: {}", changes.added.len(), changes.removed.len());

    // Checkout the first commit (restores exact state)
    checkout(&mut obj, &commits, &c1.commit_id)?;
    Ok(())
}

Installation

Minimum Supported Rust Version (MSRV): Rust 2024 edition

As a workspace dependency:

# Root Cargo.toml
[workspace.dependencies]
nusy-arrow-git = { path = "crates/nusy-arrow-git" }

Dependencies: Relies on nusy-arrow-core for the ArrowGraphStore, Triple, Namespace, and YLayer types. Uses Apache Arrow 55+ and Parquet 55+ for columnar storage.

Core Concepts

"There Are No Files"

Arrow tables ARE the data. Parquet snapshots are the undo log. There is no serialize-to-text step, no line-based diffing, no merge conflict markers in files.

When you commit(), each namespace (world, work, research, self) is written as a separate Parquet file. When you checkout(), those Parquet files are loaded back into Arrow RecordBatches. The graph store's in-memory state is always the source of truth; Parquet is the persistence layer.

┌─────────────┐   commit()    ┌───────────────────────┐
│ ArrowGraph  │──────────────→│ snapshots/{id}/        │
│ Store       │               │   world.parquet        │
│ (in-memory) │←──────────────│   work.parquet         │
└─────────────┘   checkout()  │   research.parquet     │
                              └───────────────────────┘

WAL Crash Safety

The save() function uses a write-ahead log pattern:

  1. Write _wal.json listing namespaces to be saved
  2. Write each namespace to {name}.parquet.tmp
  3. Atomic rename() from .tmp to .parquet
  4. Remove _wal.json

If the process crashes mid-save, the WAL marker tells the next restore() that the previous save was interrupted. Because writes use atomic rename, each Parquet file is either fully old or fully new — never partial.

3-Way Merge with Conflict Detection

Merges find the common ancestor (via BFS on the commit DAG), then diff both branches against it. A conflict occurs when two branches modify the same (subject, predicate, namespace) tuple with different object values.

Four built-in resolution strategies:

Strategy Behavior
Manual Return conflicts for caller to resolve
Ours Always keep branch A's value
Theirs Always keep branch B's value
LastWriterWins Compare consolidated_at timestamps

Custom strategies are supported via closure:

use nusy_arrow_git::{MergeStrategy, Resolution};

let strategy = MergeStrategy::Custom(Box::new(|conflict| {
    if conflict.namespace == "world" {
        Resolution::KeepOurs
    } else {
        Resolution::KeepTheirs
    }
}));

Performance

Benchmarked on DGX Spark with 10,000 triples:

Operation Target Measured
Commit + Checkout < 50ms ~4.3ms
Save + Restore < 100ms ~4.3ms
Being awakening (M-119) < 200ms ~50ms
Batch add 10K triples < 10ms ~4.3ms
Single commit < 25ms ~3ms
Single checkout < 25ms ~1.3ms

All operations are well under their performance gates, leaving headroom for graphs 10-100x larger.

Safety and Invariants

  • CommitsTable is append-only. History is never rewritten. Even revert() creates a new commit with inverse changes.
  • WAL guarantees. A crash during save() cannot corrupt existing Parquet files. Atomic rename ensures each file is fully old or fully new.
  • Tag immutability. Once created, a tag's commit pointer cannot be changed or deleted. Use delete_branch() for mutable refs.
  • HEAD protection. The currently checked-out branch cannot be deleted.
  • Merge commits record both parents. The commit DAG is a true DAG, not a linear chain.
  • Schema versioning. Parquet files include a nusy_schema_version metadata key. On checkout, normalize_to_current() adapts old schemas forward.
  • Thread safety. Individual operations are not internally synchronized. Callers must ensure exclusive access to GitObjectStore during mutations.

Panics

  • create_commit will return CommitError::Io if the snapshot directory cannot be created (permissions, disk full).
  • checkout will return CommitError::NotFound if the commit's snapshot directory is missing (e.g., manually deleted).
  • revert returns RevertError::MergeCommit for merge commits (multiple parents).

No function in this crate calls panic!() or unwrap() in library code.

See Also

  • nusy-arrow-core — The Arrow graph store, triple schema, namespace partitioning, and Y-layer classification that nusy-arrow-git operates on.
  • nusy-kanban — Arrow-native kanban CLI that uses save_named_batches() / restore_named_batches() for crash-safe persistence. A real-world consumer of this crate's save primitives.
  • nusy-graph-review — Graph-native code review built on top of nusy-arrow-git's diff and merge primitives.

License

MIT — see LICENSE for details.