pageinfo-rs
CLI tool and library for researching web pages. Built to help LLMs inspect sites and build crawlers.
HTTP-only. No browser automation. Uses wreq with TLS fingerprinting via wreq-util for browser emulation.
What It Does
Fetches a page and exposes structural evidence:
- page identity and fetch result
- internal URL structure (groups, depth, sections)
- curated metadata
- feed-like URLs
- structured-data / embedded JSON signals (JSON-LD, Next.js data, inline JSON)
- extracted page text
Install
Binary name: pginf. Library crate: pageinfo_rs.
Library Usage
PageClient is the core HTTP client. Usable from any async Rust code:
use ;
let client = builder
.proxy?
.browser
.timeout
.build;
let cached_page = client.fetch.await?;
Link Extraction
use ;
use Html;
use Url;
let doc = parse_document;
let base = parse?;
// All links, normalized (lowercase host, no fragment)
let links: = extract_links;
// Internal links can be selected from processed links
let internal: = links.iter.filter.collect;
// Manual normalization/tracking on individual links
let mut link = links.clone;
link.normalize; // "https://example.com/page?utm_source=x"
link.strip_tracking; // "https://example.com/page"
// Classification helpers
link.is_asset; // true for .css, .js, .png, .svg, .woff2, etc.
link.is_same_host; // exact host match, not registered domain
Also exported: extract_registered_domain, UrlFacts, DateKind.
pageinfo_rs re-exports Emulation, wreq, and wreq_util — no extra direct dependencies needed.
FetchResult includes fetch transparency fields: emulation_used, proxy_used (masked), attempts.
Features:
- Proxy support with inline auth (
socks5://user:pass@host:port). Falls back toHTTPS_PROXY/HTTP_PROXYenv vars. - Browser emulation via
wreq_util::Emulation— sets TLS fingerprint and headers. Available: Chrome 100–137, Firefox, Safari, Edge, OkHttp. - Automatic fallback — on 403/429/503 or connection errors, retries with the next browser in the fallback chain. Default chain: Chrome 136, Firefox 139, Safari 18.5.
- Timeout — configurable, default 30 seconds.
CLI Commands
fetch
Fetch a page, cache it, and print HTTP metadata.
Cache flags:
links
Show URL groups, path depth, and internal/external link structure.
meta
Show curated page metadata.
json
Show structured data signals such as JSON-LD and Next.js data.
text
Extract page text content.
html
Show HTML content, optionally filtered by CSS selector. Uses the same page cache as the analysis commands.
http
Low-level HTTP debug command. Shows request/response headers, body, and timing.
install
Install pginf skill files for AI coding agents.
help
Built-in documentation.
Global Flags
Apply to all commands that fetch pages:
| Flag | Description |
|---|---|
--proxy <URL> |
Proxy URL with optional inline auth |
--browser <NAME> |
Browser emulation: chrome137, firefox, safari, edge, okhttp |
--timeout <SECS> |
Request timeout in seconds |
For LLMs
An LLM tool skill is available at skills/pginf.md. Install it with:
Cache
fetch, links, meta, json, text, and html cache fetched pages
locally in .pginf/. Stored data: fetch metadata, response headers, raw HTML.
Cache behavior:
- default: read cache on hit, fetch on miss, store result
--refresh: refetch and overwrite cache entry--no-cache: skip cache read and write
Architecture
src/
client.rs PageClient — HTTP fetching, proxy, browser emulation, fallback
http_display.rs HTTP transaction types and formatting (for `http` command)
output.rs Shared `text|json|toon` rendering traits
skills.rs Embedded skill file + install logic (for `install` command)
analyzer.rs Page analysis: link extraction, URL grouping, metadata, text
cache/ File-based page cache (.pginf/)
html.rs Legacy page info extraction (used by `http` command)
help.rs Built-in help text
main.rs CLI entry point
All HTTP fetching flows through PageClient. No raw wreq::Client construction outside of client.rs.
License
GPL-3.0