mollendorff-ref 1.6.0

Renders web pages and PDFs into token-optimized JSON for LLM agents
Documentation

Möllendorff Ref

CI crates.io Tests Coverage License: MIT

Renders web pages and PDFs into token-optimized JSON for LLM agents.

The problem

LLM agents are bad at getting fresh web content:

  • Stale training data — without explicit tooling, models default to their training corpus, returning confident answers from months-old snapshots. Even with web search enabled, models can weight stale training knowledge over fresh results.
  • Bot protectioncurl, wget, and built-in web tools get blocked (403/999) by most sites. When they do get through, they return raw HTML — 10,000-50,000 tokens of navigation, ads, and markup noise.
  • SPA blindness — modern sites return empty HTML shells. The actual content loads via JavaScript after the initial response.

What ref does

Chrome renders the page, waits for all network requests to settle (SPAs included), then ref extracts structured content and outputs compact JSON.

100 KB of rendered HTML becomes 1-5 KB of structured JSON.

┌──────────┐    ┌───────────┐    ┌──────────┐    ┌──────────┐
│ Headless │ →  │ networkIdle│ →  │ Strip    │ →  │ Compact  │
│ Chrome   │    │ wait (SPA) │    │ nav/ads  │    │ JSON out │
└──────────┘    └───────────┘    └──────────┘    └──────────┘

Output example

ref fetch https://example.com 2>/dev/null | jq .
{
  "url": "https://example.com",
  "status": "ok",
  "title": "Example Domain",
  "sections": [
    {
      "level": 1,
      "heading": "Example Domain",
      "content": "This domain is for use in illustrative examples..."
    }
  ],
  "links": [
    { "text": "More information...", "url": "https://www.iana.org/domains/example" }
  ],
  "chars": 1256
}

Extract a specific element with --selector:

ref fetch --selector "#pricing-table" https://example.com 2>/dev/null | jq .

Null fields and empty arrays are omitted. Sections are capped (200 char headings, 2,000 char content). Code blocks include detected language. Status detects paywalls, login walls, and dead links.

PDF extraction

ref pdf document.pdf 2>/dev/null | jq .

Same JSON structure, plus table detection (whitespace column analysis, header inference, markdown output) and heading detection (numbered sections, Roman numerals, ALL CAPS, academic/legal formats).

Also accepts URLs — downloads and extracts in one step:

ref pdf https://example.com/report.pdf 2>/dev/null | jq .

Commands

Command Description
ref fetch <url> Render page via Chrome, output structured JSON
ref pdf <file|url> Extract text and tables from PDFs
ref scan <files> Find URLs in markdown, build references.yaml
ref verify-refs <file> Check reference entries, update status
ref check-links <file> Validate URL health (HTTP status codes)
ref refresh-data --url <url> Extract live data (market sizes, stats)
ref init Create references.yaml template
ref update Self-update from GitHub releases
ref mcp Start MCP server (JSON-RPC 2.0 over stdio)

MCP Server Mode

ref mcp starts a persistent MCP server over stdio. AI applications call tools directly — no shell spawning, browser pool stays warm between calls.

Six tools: ref_fetch, ref_pdf, ref_check_links, ref_scan, ref_verify_refs, ref_refresh_data.

Claude Code (.mcp.json in project root):

{
  "mcpServers": {
    "ref": {
      "command": "ref",
      "args": ["mcp"]
    }
  }
}

Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ref": {
      "command": "/usr/local/bin/ref",
      "args": ["mcp"]
    }
  }
}

See docs/mcp-integration.md for full setup guide and tool reference.

AI orchestration

ref is designed to work with Asimov, a vendor-neutral orchestrator for AI coding CLIs (Claude Code, Gemini CLI, Codex CLI).

Asimov's freshness protocol forces agents to use ref fetch for all web content instead of relying on built-in search tools or training data:

{
  "rule": "MUST use ref fetch <url> for all web fetching. NEVER use WebSearch or WebFetch."
}

This ensures agents work with current, verified content rather than stale or hallucinated sources.

Install

From releases:

# macOS (Apple Silicon)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# macOS (Intel)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (x64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

# Linux (ARM64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-aarch64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/

From crates.io:

cargo install mollendorff-ref

Requirements

  • Chrome or Chromium (for fetch, check-links, verify-refs)
  • Rust toolchain (build from source only)

License

MIT