nusy-arrow-git

Graph-native Git for Arrow RecordBatches — version, branch, merge, and diff knowledge graphs at sub-5ms latency.

Motivation

Traditional Git versions files. But when your data is a knowledge graph — millions of RDF-like triples stored as Arrow RecordBatches — serializing to files and back is a wasteful detour.

nusy-arrow-git operates directly on in-memory Arrow tables. Commits snapshot namespace-partitioned RecordBatches to Parquet. Checkouts restore them. Diffs compare triples, not lines. Merges detect semantic conflicts (same subject + predicate, different objects), not textual ones.

The result: version control that understands your data's structure, with commit+checkout cycles completing in under 5ms for 10,000 triples.

Features

Twelve git primitives, all operating on Arrow RecordBatches:

Primitive	Description
commit	Snapshot graph state to per-namespace Parquet files
checkout	Restore graph state from a commit's snapshots
history	DAG traversal — `log`, `ancestors`, `find_common_ancestor`
refs	Mutable branches + immutable tags via `RefsTable`
diff	Object-level comparison — added/removed triples with full provenance
merge	3-way merge with pluggable conflict resolution strategies
save	Crash-safe persistence via WAL + atomic Parquet writes
cherry_pick	Apply a single commit's changes onto a different HEAD
revert	Create an inverse commit that undoes a previous change
delete_branch	Remove a branch ref (cannot delete HEAD)
tags	Immutable ref pointers for release marking
object_store	`GitObjectStore` wrapper with configurable snapshot directory

Quick Start

Add to your Cargo.toml:

[dependencies]
nusy-arrow-git = "0.14"
nusy-arrow-core = "0.14"

Create a store, add triples, commit, checkout, and diff:

use nusy_arrow_core::{Namespace, Triple, YLayer};
use nusy_arrow_git::{
    CommitsTable, GitObjectStore, checkout, create_commit, diff,
};

fn triple(s: &str, p: &str, o: &str) -> Triple {
    Triple {
        subject: s.into(), predicate: p.into(), object: o.into(),
        graph: None, confidence: None, source_document: None,
        source_chunk_id: None, extracted_by: None, caused_by: None,
        derived_from: None, consolidated_at: None,
    }
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create a git-aware graph store
    let mut obj = GitObjectStore::with_snapshot_dir("my_snapshots");
    let mut commits = CommitsTable::new();

    // Add a triple and commit
    obj.store.add_triple(&triple("alice", "knows", "bob"), Namespace::World, YLayer::Semantic)?;
    let c1 = create_commit(&obj, &mut commits, vec![], "Add alice→bob", "dev")?;

    // Add another triple and commit
    obj.store.add_triple(&triple("bob", "knows", "carol"), Namespace::World, YLayer::Semantic)?;
    let c2 = create_commit(&obj, &mut commits, vec![c1.commit_id.clone()], "Add bob→carol", "dev")?;

    // Diff between commits
    let changes = diff(&mut obj, &commits, &c1.commit_id, &c2.commit_id)?;
    println!("Added: {}, Removed: {}", changes.added.len(), changes.removed.len());

    // Checkout the first commit (restores exact state)
    checkout(&mut obj, &commits, &c1.commit_id)?;
    Ok(())
}

Installation

Minimum Supported Rust Version (MSRV): Rust 2024 edition

As a workspace dependency:

# Root Cargo.toml
[workspace.dependencies]
nusy-arrow-git = { path = "crates/nusy-arrow-git" }

Dependencies: Relies on nusy-arrow-core for the ArrowGraphStore, Triple, Namespace, and YLayer types. Uses Apache Arrow 55+ and Parquet 55+ for columnar storage.

Core Concepts

"There Are No Files"

Arrow tables ARE the data. Parquet snapshots are the undo log. There is no serialize-to-text step, no line-based diffing, no merge conflict markers in files.

When you commit(), each namespace (world, work, research, self) is written as a separate Parquet file. When you checkout(), those Parquet files are loaded back into Arrow RecordBatches. The graph store's in-memory state is always the source of truth; Parquet is the persistence layer.

┌─────────────┐   commit()    ┌───────────────────────┐
│ ArrowGraph  │──────────────→│ snapshots/{id}/        │
│ Store       │               │   world.parquet        │
│ (in-memory) │←──────────────│   work.parquet         │
└─────────────┘   checkout()  │   research.parquet     │
                              └───────────────────────┘

WAL Crash Safety

The save() function uses a write-ahead log pattern:

Write _wal.json listing namespaces to be saved
Write each namespace to {name}.parquet.tmp
Atomic rename() from .tmp to .parquet
Remove _wal.json

If the process crashes mid-save, the WAL marker tells the next restore() that the previous save was interrupted. Because writes use atomic rename, each Parquet file is either fully old or fully new — never partial.

3-Way Merge with Conflict Detection

Merges find the common ancestor (via BFS on the commit DAG), then diff both branches against it. A conflict occurs when two branches modify the same (subject, predicate, namespace) tuple with different object values.

Four built-in resolution strategies:

Strategy	Behavior
`Manual`	Return conflicts for caller to resolve
`Ours`	Always keep branch A's value
`Theirs`	Always keep branch B's value
`LastWriterWins`	Compare `consolidated_at` timestamps

Custom strategies are supported via closure:

use nusy_arrow_git::{MergeStrategy, Resolution};

let strategy = MergeStrategy::Custom(Box::new(|conflict| {
    if conflict.namespace == "world" {
        Resolution::KeepOurs
    } else {
        Resolution::KeepTheirs
    }
}));

Performance

Benchmarked on DGX Spark with 10,000 triples:

Operation	Target	Measured
Commit + Checkout	< 50ms	~4.3ms
Save + Restore	< 100ms	~4.3ms
Being awakening (M-119)	< 200ms	~50ms
Batch add 10K triples	< 10ms	~4.3ms
Single commit	< 25ms	~3ms
Single checkout	< 25ms	~1.3ms

All operations are well under their performance gates, leaving headroom for graphs 10-100x larger.

Safety and Invariants

CommitsTable is append-only. History is never rewritten. Even revert() creates a new commit with inverse changes.
WAL guarantees. A crash during save() cannot corrupt existing Parquet files. Atomic rename ensures each file is fully old or fully new.
Tag immutability. Once created, a tag's commit pointer cannot be changed or deleted. Use delete_branch() for mutable refs.
HEAD protection. The currently checked-out branch cannot be deleted.
Merge commits record both parents. The commit DAG is a true DAG, not a linear chain.
Schema versioning. Parquet files include a nusy_schema_version metadata key. On checkout, normalize_to_current() adapts old schemas forward.
Thread safety. Individual operations are not internally synchronized. Callers must ensure exclusive access to GitObjectStore during mutations.

Panics

create_commit will return CommitError::Io if the snapshot directory cannot be created (permissions, disk full).
checkout will return CommitError::NotFound if the commit's snapshot directory is missing (e.g., manually deleted).
revert returns RevertError::MergeCommit for merge commits (multiple parents).

No function in this crate calls panic!() or unwrap() in library code.

License

MIT — see LICENSE for details.

nusy-arrow-git 0.15.2