pageinfo-rs
CLI tool and library for researching web pages. Built to help LLMs inspect sites and build crawlers.
HTTP-only. No browser automation. Uses wreq with TLS fingerprinting via wreq-util for browser emulation.
What It Does
Fetches a page and exposes structural evidence:
- page identity and fetch result
- internal URL structure (groups, depth, sections)
- curated metadata
- feed-like URLs
- structured-data / embedded JSON signals (JSON-LD, Next.js data, inline JSON)
- extracted page text
Install
Binary name: pginf. Library crate: pageinfo_rs.
Library Usage
PageClient is the core HTTP client. Usable from any async Rust code:
use ;
let client = builder
.proxy?
.browser
.timeout
.build;
let cached_page = client.fetch.await?;
pageinfo_rs re-exports Emulation, wreq, and wreq_util — no extra direct dependencies needed.
Features:
- Proxy support with inline auth (
socks5://user:pass@host:port). Falls back toHTTPS_PROXY/HTTP_PROXYenv vars. - Browser emulation via
wreq_util::Emulation— sets TLS fingerprint and headers. Available: Chrome 100–137, Firefox, Safari, Edge, OkHttp. - Automatic fallback — on 403/429/503 or connection errors, retries with the next browser in the fallback chain. Default chain: Chrome 136, Firefox 139, Safari 18.5.
- Timeout — configurable, default 30 seconds.
CLI Commands
analyze
Main research command. Uses local page cache by default.
Focused views:
Cache flags:
http
Low-level HTTP debug command. Shows request/response headers, body, and timing.
html
Show HTML content, optionally filtered by CSS selector. Uses same cache as analyze.
install
Install pginf skill files for AI coding agents.
help
Built-in documentation.
Global Flags
Apply to all commands that fetch pages:
| Flag | Description |
|---|---|
--proxy <URL> |
Proxy URL with optional inline auth |
--browser <NAME> |
Browser emulation: chrome137, firefox, safari, edge, okhttp |
--timeout <SECS> |
Request timeout in seconds |
For LLMs
An LLM tool skill is available at skills/pginf.md. Install it with:
Cache
analyze and html cache fetched pages locally in .pginf/. Stored data: fetch metadata, response headers, raw HTML.
Cache behavior:
- default: read cache on hit, fetch on miss, store result
--refresh: refetch and overwrite cache entry--no-cache: skip cache read and write
Architecture
src/
client.rs PageClient — HTTP fetching, proxy, browser emulation, fallback
http_display.rs HTTP transaction types and formatting (for `http` command)
skills.rs Embedded skill file + install logic (for `install` command)
analyzer/ Page analysis: link extraction, URL grouping, metadata, structured data
cache/ File-based page cache (.pginf/)
html.rs Legacy page info extraction (used by `http` command)
help.rs Built-in help text
main.rs CLI entry point
All HTTP fetching flows through PageClient. No raw wreq::Client construction outside of client.rs.
License
GPL-3.0