fetchkit
AI-friendly web content fetching tool designed for LLM consumption. Rust library with CLI, MCP server, and Python bindings.
Features
- HTTP fetching - GET and HEAD methods with streaming support
- HTML-to-Markdown - Built-in conversion optimized for LLMs
- HTML-to-Text - Plain text extraction with clean formatting
- Binary detection - Returns metadata only for images, PDFs, etc.
- Timeout handling - 1s first-byte, 30s body with partial content on timeout
- Safety limits - 10 MB default decompressed body cap with truncation
- URL filtering - URL-aware allow/block lists for controlled access
- SSRF protection - Resolve-then-check blocks private IPs by default
- MCP server - Model Context Protocol support for AI tool integration
Installation
From crates.io (recommended)
From Git
From Source
CLI Usage
# Fetch URL (outputs markdown with frontmatter)
# Output as JSON instead
# Custom user agent
# Hardened outbound policy for cluster/data-plane use
# Show full documentation
Default output is markdown with YAML frontmatter:
url: https://example.com
status_code: 200
source_content_type: text/html; charset=UTF-8
source_size: 1256
This domain is for use in illustrative examples in documents...
JSON output (-o json):
MCP Server
Run as a Model Context Protocol server:
# Hardened profile for cluster/data-plane use
Exposes fetchkit tool over JSON-RPC 2.0 stdio transport. Returns markdown with frontmatter (same format as CLI). Compatible with Claude Desktop and other MCP clients.
Library Usage
Add to Cargo.toml:
[]
= "0.2"
Basic Fetch
use ;
async
With Tool Builder
use ;
let tool = new
.enable_markdown
.enable_text
.user_agent
.allow_prefix
.block_prefix
.build;
let request = new;
let response = tool.execute.await.unwrap;
Toolkit Contract Surface
use ToolBuilder;
let builder = new.enable_save_to_file;
let tool = builder.build;
assert_eq!;
assert_eq!;
let definition = builder.build_tool_definition;
let mut service = builder.build_service;
Hardened Tool Profile
use Tool;
let tool = builder
.hardened
.allow_prefix
.build;
Python Bindings
# Simple fetch
=
# With configuration
=
=
Response Fields
| Field | Type | Description |
|---|---|---|
url |
string | Fetched URL |
status_code |
int | HTTP status code |
content_type |
string? | Content-Type header |
size |
int? | Content size in bytes |
last_modified |
string? | Last-Modified header |
filename |
string? | From Content-Disposition |
format |
string? | "markdown", "text", "raw", or "github_repo" |
content |
string? | Page content |
truncated |
bool? | True if content was cut off |
method |
string? | "HEAD" for HEAD requests |
error |
string? | Error message if failed |
Error Handling
Errors are returned in the error field:
InvalidUrl- Malformed URLUrlBlocked- URL blocked by filterNetworkError- Connection failedTimeout- Request timed outHttpError- 4xx/5xx responseContentError- Failed to read bodyBinaryContent- Binary content not supported
Security
FetchKit blocks connections to private/reserved IP ranges by default, preventing SSRF attacks when used in server-side or AI agent contexts.
Blocked by default: loopback, private networks (10.x, 172.16-31.x, 192.168.x), link-local (169.254.x including cloud metadata), IPv6 equivalents, multicast, and other reserved ranges.
// Default: private IPs blocked (safe for production)
let tool = default;
// Explicit opt-out for local development only
let tool = builder
.block_private_ips
.build;
DNS pinning prevents DNS rebinding attacks. IPv6-mapped IPv4 addresses are canonicalized before validation.
Redirects are followed manually in the default fetcher so each hop is revalidated against scheme and DNS policy. Allow/block prefixes are matched against parsed URLs rather than raw strings, which prevents lookalike host overmatches such as allowed.example.com.evil.test.
Proxy environment variables are ignored by default. Use the hardened profile for cluster-facing deployments and opt in with ToolBuilder::respect_proxy_env(true) only when it is part of an intentional egress design.
See specs/threat-model.md for the full threat model.
See docs/hardening.md for deployment guidance.
Configuration
Timeouts And Limits
- First-byte: 1 second (connect + initial response)
- Body: 30 seconds total
- Body size: 10 MB decompressed content by default
Partial content is returned on body timeout or body-size limit with truncated: true.
Binary Content
Automatically detected and returns metadata only for:
- Images, audio, video, fonts
- PDFs, archives (zip, tar, rar, 7z)
- Office documents
HTML Conversion
HTML is automatically converted to markdown:
- Headers:
h1-h6→#to###### - Lists: Proper nesting with 2-space indent
- Code: Fenced blocks and inline backticks
- Links:
[text](url)format - Strips: scripts, styles, iframes, SVGs
License
MIT