generative-artifact-protocol 0.16.0

Generative Artifact Protocol (GAP) — token-efficient artifact generation and updates for LLMs
Documentation

Warning: This project is v0 — the protocol, schemas, and APIs are subject to breaking changes without notice until a formal release.

An open standard protocol — GAP — that lets LLMs declare, diff, and reprovision text artifacts with minimal token expenditure. Includes a Rust reference implementation of the apply engine plus an evaluation CLI for measuring token efficiency against real LLM runs.

Features

  • Envelope system — three operation types (synthesize, edit, handle) for full generation, targeted updates, and lightweight references
  • Stateless apply engine — pure function, no I/O, ~2μs per edit; portable to browsers (WASM), IDEs, CLIs, or service backends
  • ID-based targeting<gap:target id="ID"> markers and JSON Pointer paths eliminate hallucinated search strings
  • Format-agnostic — works with HTML, Python, JavaScript, JSON, YAML, Rust, Go, SVG, and more
  • Up to 99% output token reduction per edit — actual reduction depends on edit scope: ~95-99% for value changes, ~80-95% for section rewrites; total cost savings depend on model pricing ratio and edit history (cost model)
  • SSE transport binding — wire format for streaming with reconnection support (GAP-SSE)
  • Evaluation framework — 90 experiment datasets measuring token efficiency and reliability against real LLM runs

Install

Rust crate:

cargo add generative-artifact-protocol

From source (full workspace):

git clone https://github.com/urmzd/generative-artifact-protocol
cd generative-artifact-protocol

Requires Rust (stable) and optionally just (for recipes).

Quick Start

# Build the Rust library
just build

# Run tests
just test

# Run criterion benchmarks (apply engine speed)
just bench

# Build the eval CLI (release)
just build-eval

# Run LLM evaluations
just run count=5 model="gemini-2.0-flash"

# Generate report from experiment metrics
just report

Usage

How it works

LLM ──produces──▶ envelope ──apply──▶ (artifact, handle)
                                 ▲
                           gap (stateless, ~2μs)
  1. An LLM produces an artifact envelope (JSON) — either a synthesize envelope (full content with target markers) or an edit envelope (targeted changes by ID or JSON Pointer).
  2. The apply engine resolves the envelope against the current artifact state to produce the updated artifact and a lightweight handle.
  3. The orchestrator holds handles; the resolved artifact is stored and consumed by downstream tools — browsers, IDEs, etc.

Apply engine

The core of the library is a single stateless function:

pub fn apply(artifact: Option<&Artifact>, envelope: &Envelope) -> Result<(Artifact, Envelope)>
Envelope Direction Description
synthesize input Complete artifact content (baseline or reset) with <gap:target> markers
edit input Targeted changes via ID (<gap:target> markers) or JSON Pointer
handle output Lightweight reference returned after every synthesize or edit

Recipes

Recipe Description
just build Compile the Rust library
just test Run Rust unit tests
just bench Criterion micro-benchmarks (apply engine speed)
just build-eval Build the eval CLI (release)
just run [count] [model] [id] [flow] [api-base] [api-key] Run conversation benchmark experiments (base vs GAP flows)
just report Generate markdown report from experiment metrics
just score Retroactive quality scoring of experiment results

Running evals on a free tier

The eval CLI speaks the OpenAI chat completions wire format, so any OpenAI-compatible endpoint works. Point it at a free-tier provider with GAP_API_BASE and GAP_API_KEY (or --api-base / --api-key). The CLI also picks up OPENAI_API_KEY as a fallback.

Provider Base URL Free-tier highlights
Google AI Studio (Gemini) https://generativelanguage.googleapis.com/v1beta/openai/ Persistent free tier on gemini-2.5-flash / gemini-2.0-flash — recommended starting point
Groq https://api.groq.com/openai/v1 Free llama-3.3-70b-versatile, qwen3-32b, etc.; sub-second TTFT, ~30 RPM
Cerebras https://api.cerebras.ai/v1 Free llama-4-scout, qwen-3-32b; fastest tokens/sec on the list
OpenRouter https://openrouter.ai/api/v1 25+ models suffixed :free (e.g. deepseek/deepseek-chat:free); aggregates many providers
Mistral La Plateforme https://api.mistral.ai/v1 Free "Experiment" tier — 1 B tokens/month at ~2 RPM
GitHub Models https://models.inference.ai.azure.com Free preview for developers; OpenAI, Llama, Mistral, Phi
Cloudflare Workers AI https://api.cloudflare.com/client/v4/accounts/<ACCOUNT_ID>/ai/v1 Free daily quota across Llama, Mistral, Qwen

Example — run five experiments against Gemini for free:

export GAP_API_BASE="https://generativelanguage.googleapis.com/v1beta/openai/"
export GAP_API_KEY="$GEMINI_API_KEY"
just run count=5 model="gemini-2.5-flash"

Example — run a single experiment against an OpenRouter free model:

just run count=1 model="deepseek/deepseek-chat:free" \
  api-base="https://openrouter.ai/api/v1" api-key="$OPENROUTER_API_KEY"

Tip — model asymmetry for cheaper runs. GAP's maintain context only needs recall and structured output. A frontier model for init paired with a free or near-free model for maintain matches the model asymmetry the cost model assumes. Today the CLI runs both flows with one model — for asymmetric runs, invoke gap-eval run --flow gap against the cheaper provider after a baseline pass with the stronger one.

Caveat. Free tiers carry per-minute and per-day caps. For the full 90-experiment suite, expect to throttle or batch with --count and --id.

Cost model

GAP saves tokens by replacing full artifact regeneration with small edit envelopes. The actual savings are not deterministic — they depend on edit scope, how efficiently the LLM generates envelopes, the model's tokenizer, and pricing. The LLM may produce larger-than-minimal edits or fall back to full regeneration. The estimates below assume well-formed, targeted edits. See the full derivation in the spec.

In a naive conversation, each turn's input carries everything: instructions, all prior artifact versions, all prior messages, and the current request — growing quadratically with edit count. GAP's maintain context is stateless: each call starts fresh with only the instructions ($I$), the current artifact ($S$), and the edit request. It produces a small edit envelope ($d$ output tokens) and terminates. The apply engine resolves the edit deterministically (CPU, ~2μs, zero tokens). Three effects compound:

  • Input token reduction: a naive conversation at edit $k$ reads $I + S_0 + S_1 + \ldots + S_{k-1}$ — every prior version. GAP reads only $I + S_{k-1}$ — the current artifact. At 10 edits, this is ~78% fewer input tokens.
  • Output token reduction: the LLM produces $d$ tokens instead of $S$ per edit — only the changed content plus a constant envelope overhead (JSON structure, target IDs). $d$ scales with the size of the change, not the size of the artifact. A two-value edit on a 2,000-token artifact produces ~30 tokens; a section rewrite produces hundreds. Either way, unchanged content is never regenerated.
  • Model asymmetry: the maintain context can use a cheaper model than the orchestrator, multiplying savings further.

Projected example — using a reference model at $3/M input, $15/M output ($r = 5$), with a 2,000-token artifact, 30-token edit envelope, and 500 instruction tokens:

After $N$ edits Naive conversation GAP Estimated savings
1 $0.069 $0.039 ~43%
5 $0.279 $0.071 ~75%
10 $0.677 $0.111 ~84%

These figures assume ideal edit behavior ($d = 30$ tokens per edit). Actual savings vary — larger edits, section-level rewrites, or model-specific tokenization will shift these numbers. Savings also scale with $r$: at $r = 1$ (equal pricing), per-edit savings drop to ~44%; at $r = 5$, they reach ~79%.

Payload benchmarks

Payload size and apply time for each envelope type, measured against an 8 KB HTML dashboard fixture.

Note: "Payload savings" measures byte reduction — a proxy for output token reduction but not identical (tokenizers vary). See cost model for the full derivation.

Envelope Scenario Payload % of Full Payload savings Apply Time
synthesize Full generation (baseline) 8,164 B 100.0% 1 ns
edit 1 value replace (ID targeting) 12 B 0.1% 99.9% 1.5 µs
edit 4 value replaces (ID targeting) 50 B 0.6% 99.4% 3.5 µs
edit 1 section replace (ID targeting) 441 B 5.4% 94.6% 1.4 µs
edit 2 section replaces (ID targeting) 516 B 6.3% 93.7% 3.8 µs

Limitations

  1. Conflict resolution — GAP uses optimistic concurrency: the apply engine rejects edits whose version doesn't match stored_version + 1. Concurrent edits are rejected, not merged. There is no CRDT, OT, or automatic merge strategy — coordination is left to the orchestrator.

  2. Envelope generation — The LLM must produce well-formed envelopes with correct target IDs, valid JSON structure, and appropriate operation types. In practice, generation can fail: malformed envelopes, hallucinated target IDs, or the model falling back to full regeneration when a targeted edit would suffice. The spec mitigates recall errors by passing a closed targets list in handles (Section 7.2), but reliable envelope production remains an active area of work.

  3. Granularity tradeoff — Target placement affects generation reliability. Too coarse and edits replace large sections, losing efficiency. Too fine and the target set grows, increasing the chance of targeting errors and prompt complexity. When the system can't produce a valid envelope for a given granularity, it must fall back — and those fallback strategies need to be thought through.

  4. Adoption — GAP requires tooling on both sides: producers must emit valid envelopes, consumers must implement the apply engine. The protocol is open and the spec is public, but the ecosystem is early.

License

This project is dual-licensed:

See NOTICE for details. Attribution is required under both licenses.