mollendorff-ref 1.3.0

Renders web pages and PDFs into token-optimized JSON for LLM agents
Documentation

Möllendorff Ref

Renders web pages and PDFs into token-optimized JSON for LLM agents.

The problem

LLM agents are bad at getting fresh web content:

  • Stale training data — without explicit tooling, models default to their training corpus, returning confident answers from months-old snapshots. Even with web search enabled, models can weight stale training knowledge over fresh results.
  • Bot protectioncurl, wget, and built-in web tools get blocked (403/999) by most sites. When they do get through, they return raw HTML — 10,000-50,000 tokens of navigation, ads, and markup noise.
  • SPA blindness — modern sites return empty HTML shells. The actual content loads via JavaScript after the initial response.

What ref does

Chrome renders the page, waits for all network requests to settle (SPAs included), then ref extracts structured content and outputs compact JSON.

100 KB of rendered HTML becomes 1-5 KB of structured JSON.

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│ Headless │ →  │ networkIdle│ →  │ Strip    │ →  │ Compact  │
│ Chrome   │    │ wait (SPA) │    │ nav/ads  │    │ JSON out │
└──────────┘    └───────────┘    └──────────┘    └──────────┘

Output example

ref fetch https://example.com 2>/dev/null | jq .
{
  "url": "https://example.com",
  "status": "ok",
  "title": "Example Domain",
  "sections": [
    {
      "level": 1,
      "heading": "Example Domain",
      "content": "This domain is for use in illustrative examples..."
    }
  ],
  "links": [
    { "text": "More information...", "url": "https://www.iana.org/domains/example" }
  ],
  "chars": 1256
}

Null fields and empty arrays are omitted. Sections are capped (200 char headings, 2,000 char content). Code blocks include detected language. Status detects paywalls, login walls, and dead links.

PDF extraction

ref pdf document.pdf 2>/dev/null | jq .

Same JSON structure, plus table detection (whitespace column analysis, header inference, markdown output) and heading detection (numbered sections, Roman numerals, ALL CAPS, academic/legal formats).

Commands

Command Description
ref fetch <url> Render page via Chrome, output structured JSON
ref pdf <file> Extract text and tables from PDFs
ref scan <files> Find URLs in markdown, build references.yaml
ref verify-refs <file> Check reference entries, update status
ref check-links <file> Validate URL health (HTTP status codes)
ref refresh-data --url <url> Extract live data (market sizes, stats)
ref init Create references.yaml template
ref update Self-update from GitHub releases

AI orchestration

ref is designed to work with Asimov, a vendor-neutral orchestrator for AI coding CLIs (Claude Code, Gemini CLI, Codex CLI).

Asimov's freshness protocol forces agents to use ref fetch for all web content instead of relying on built-in search tools or training data:

{
  "rule": "MUST use ref fetch <url> for all web fetching. NEVER use WebSearch or WebFetch."
}

This ensures agents work with current, verified content rather than stale or hallucinated sources.

Install

From releases:

# macOS (Apple Silicon)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (x64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (ARM64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

From crates.io:

cargo install mollendorff-ref

Requirements

  • Chrome or Chromium (for fetch, check-links, verify-refs)
  • Rust toolchain (build from source only)

License

MIT