# Möllendorff Ref
[](https://github.com/mollendorff-ai/ref/actions/workflows/ci.yml)
[](https://crates.io/crates/mollendorff-ref)
[](https://github.com/mollendorff-ai/ref/actions)
[](https://github.com/mollendorff-ai/ref/actions)
[](LICENSE)
Renders web pages and PDFs into token-optimized JSON for LLM agents.
## The problem
LLM agents are bad at getting fresh web content:
- **Stale training data** — without explicit tooling, models default to their training corpus, returning confident answers from months-old snapshots.
Even with web search enabled, models can weight stale training knowledge over fresh results.
- **Bot protection** — `curl`, `wget`, and built-in web tools get blocked (403/999) by most sites.
When they do get through, they return raw HTML — 10,000-50,000 tokens of navigation, ads, and markup noise.
- **SPA blindness** — modern sites return empty HTML shells.
The actual content loads via JavaScript after the initial response.
## What ref does
Chrome renders the page, waits for all network requests to settle (SPAs included), then ref extracts structured content and outputs compact JSON.
**100 KB of rendered HTML becomes 1-5 KB of structured JSON.**
```
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐
│ Headless │ → │ networkIdle│ → │ Strip │ → │ Compact │
│ Chrome │ │ wait (SPA) │ │ nav/ads │ │ JSON out │
└──────────┘ └───────────┘ └──────────┘ └──────────┘
```
### Output example
```bash
```json
{
"url": "https://example.com",
"status": "ok",
"title": "Example Domain",
"sections": [
{
"level": 1,
"heading": "Example Domain",
"content": "This domain is for use in illustrative examples..."
}
],
"links": [
{ "text": "More information...", "url": "https://www.iana.org/domains/example" }
],
"chars": 1256
}
```
Extract a specific element with `--selector`:
```bash
Null fields and empty arrays are omitted.
Sections are capped (200 char headings, 2,000 char content).
Code blocks include detected language.
Status detects paywalls, login walls, and dead links.
### PDF extraction
```bash
Same JSON structure, plus table detection (whitespace column analysis, header inference, markdown output) and heading detection (numbered sections, Roman numerals, ALL CAPS, academic/legal formats).
Also accepts URLs — downloads and extracts in one step:
```bash
## Commands
| `ref fetch <url>` | Render page via Chrome, output structured JSON |
| `ref pdf <file\|url>` | Extract text and tables from PDFs |
| `ref scan <files>` | Find URLs in markdown, build references.yaml |
| `ref verify-refs <file>` | Check reference entries, update status |
| `ref check-links <file>` | Validate URL health (HTTP status codes) |
| `ref refresh-data --url <url>` | Extract live data (market sizes, stats) |
| `ref init` | Create references.yaml template |
| `ref update` | Self-update from GitHub releases |
| `ref mcp` | Start MCP server (JSON-RPC 2.0 over stdio) |
## MCP Server Mode
`ref mcp` starts a persistent MCP server over stdio.
AI applications call tools directly — no shell spawning, browser pool stays warm between calls.
Six tools: `ref_fetch`, `ref_pdf`, `ref_check_links`, `ref_scan`, `ref_verify_refs`, `ref_refresh_data`.
**Claude Code** (`.mcp.json` in project root):
```json
{
"mcpServers": {
"ref": {
"command": "ref",
"args": ["mcp"]
}
}
}
```
**Claude Desktop** (`~/Library/Application Support/Claude/claude_desktop_config.json`):
```json
{
"mcpServers": {
"ref": {
"command": "/usr/local/bin/ref",
"args": ["mcp"]
}
}
}
```
See [docs/mcp-integration.md](docs/mcp-integration.md) for full setup guide and tool reference.
## AI orchestration
ref is designed to work with [Asimov](https://github.com/mollendorff-ai/asimov), a vendor-neutral orchestrator for AI coding CLIs (Claude Code, Gemini CLI, Codex CLI).
Asimov's `freshness` protocol forces agents to use `ref fetch` for all web content instead of relying on built-in search tools or training data:
```json
{
"rule": "MUST use ref fetch <url> for all web fetching. NEVER use WebSearch or WebFetch."
}
```
This ensures agents work with current, verified content rather than stale or hallucinated sources.
## Install
From [releases](https://github.com/mollendorff-ai/ref/releases):
```bash
# macOS (Apple Silicon)
# macOS (Intel)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-apple-darwin.tar.gz | tar xz
sudo mv ref /usr/local/bin/
# Linux (x64)
curl -L https://github.com/mollendorff-ai/ref/releases/latest/download/ref-x86_64-unknown-linux-musl.tar.gz | tar xz
sudo mv ref /usr/local/bin/
# Linux (ARM64)
```
From crates.io:
```bash
cargo install mollendorff-ref
```
## Requirements
- Chrome or Chromium (for fetch, check-links, verify-refs)
- Rust toolchain (build from source only)
## License
[MIT](LICENSE)