Möllendorff Ref
Renders web pages and PDFs into token-optimized JSON for LLM agents.
The problem
LLM agents are bad at getting fresh web content:
- Stale training data — without explicit tooling, models default to their training corpus, returning confident answers from months-old snapshots. Even with web search enabled, models can weight stale training knowledge over fresh results.
- Bot protection —
curl,wget, and built-in web tools get blocked (403/999) by most sites. When they do get through, they return raw HTML — 10,000-50,000 tokens of navigation, ads, and markup noise. - SPA blindness — modern sites return empty HTML shells. The actual content loads via JavaScript after the initial response.
What ref does
Chrome renders the page, waits for all network requests to settle (SPAs included), then ref extracts structured content and outputs compact JSON.
100 KB of rendered HTML becomes 1-5 KB of structured JSON.
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Headless │ → │ networkIdle│ → │ Strip │ → │ Compact │
│ Chrome │ │ wait (SPA) │ │ nav/ads │ │ JSON out │
└──────────┘ └───────────┘ └──────────┘ └──────────┘
Output example
|
Extract a specific element with --selector:
|
Null fields and empty arrays are omitted. Sections are capped (200 char headings, 2,000 char content). Code blocks include detected language. Status detects paywalls, login walls, and dead links.
PDF extraction
|
Same JSON structure, plus table detection (whitespace column analysis, header inference, markdown output) and heading detection (numbered sections, Roman numerals, ALL CAPS, academic/legal formats).
Also accepts URLs — downloads and extracts in one step:
|
Commands
| Command | Description |
|---|---|
ref fetch <url> |
Render page via Chrome, output structured JSON |
ref pdf <file|url> |
Extract text and tables from PDFs |
ref scan <files> |
Find URLs in markdown, build references.yaml |
ref verify-refs <file> |
Check reference entries, update status |
ref check-links <file> |
Validate URL health (HTTP status codes) |
ref refresh-data --url <url> |
Extract live data (market sizes, stats) |
ref init |
Create references.yaml template |
ref update |
Self-update from GitHub releases |
ref mcp |
Start MCP server (JSON-RPC 2.0 over stdio) |
MCP Server Mode
ref mcp starts a persistent MCP server over stdio.
AI applications call tools directly — no shell spawning, browser pool stays warm between calls.
Six tools: ref_fetch, ref_pdf, ref_check_links, ref_scan, ref_verify_refs, ref_refresh_data.
Claude Code (.mcp.json in project root):
Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):
See docs/mcp-integration.md for full setup guide and tool reference.
AI orchestration
ref is designed to work with Asimov, a vendor-neutral orchestrator for AI coding CLIs (Claude Code, Gemini CLI, Codex CLI).
Asimov's freshness protocol forces agents to use ref fetch for all web content instead of relying on built-in search tools or training data:
This ensures agents work with current, verified content rather than stale or hallucinated sources.
Install
From releases:
# macOS (Apple Silicon)
|
# macOS (Intel)
|
# Linux (x64)
|
# Linux (ARM64)
|
From crates.io:
Requirements
- Chrome or Chromium (for fetch, check-links, verify-refs)
- Rust toolchain (build from source only)