webpage-info
A fast, safe Rust library for extracting metadata from web pages. Parses HTML to extract titles, descriptions, OpenGraph data, Schema.org JSON-LD, links, and more.
Features
- Fast HTML parsing with cached CSS selectors (~92 MiB/s throughput)
- OpenGraph metadata extraction (og:title, og:image, etc.)
- Schema.org JSON-LD structured data parsing
- Link extraction with URL resolution
- Text content extraction excluding scripts and styles
- Async HTTP fetching with security protections
- SSRF protection blocks requests to private IPs by default
- Resource limits prevent memory exhaustion attacks
Installation
[]
= "1.0"
For HTML parsing only (no HTTP client):
[]
= { = "1.0", = false }
Quick Start
Fetch and parse a URL
use WebpageInfo;
async
Parse HTML directly
use HtmlInfo;
let html = r#"
<html>
<head>
<title>My Page</title>
<meta property="og:title" content="OpenGraph Title">
</head>
<body>
<a href="/about">About</a>
</body>
</html>
"#;
let info = from_string?;
assert_eq!;
assert_eq!;
assert_eq!;
Custom HTTP options
use ;
use Duration;
let options = new
.timeout
.max_body_size // 5 MB
.user_agent;
let info = fetch_with_options.await?;
Extracted Data
HtmlInfo
| Field | Type | Description |
|---|---|---|
title |
Option<String> |
Document title from <title> tag |
description |
Option<String> |
Meta description |
language |
Option<String> |
Language from <html lang="..."> |
canonical_url |
Option<String> |
Canonical URL from <link rel="canonical"> |
feed_url |
Option<String> |
RSS/Atom feed URL |
text_content |
String |
Extracted text (scripts/styles excluded) |
meta |
HashMap<String, String> |
All meta tags |
opengraph |
Opengraph |
OpenGraph metadata |
schema_org |
Vec<SchemaOrg> |
Schema.org JSON-LD data |
links |
Vec<Link> |
All links in the document |
OpenGraph
let og = &info.opengraph;
println!; // "article", "website", etc.
println!;
println!;
println!;
println!; // Vec<OpengraphMedia>
println!;
Schema.org
for schema in &info.schema_org
Security
SSRF Protection
By default, requests to private/internal IP addresses are blocked:
- Localhost (
127.0.0.1,::1) - Private networks (
10.x,172.16-31.x,192.168.x) - Link-local (
169.254.x- includes cloud metadata endpoints) - Internal hostnames (
.local,.internal)
To disable (not recommended for user-supplied URLs):
let options = new.block_private_ips;
Resource Limits
Default limits prevent resource exhaustion:
| Limit | Default | Option |
|---|---|---|
| Response body | 10 MB | max_body_size() |
| Links | 10,000 | - |
| Schema.org items | 100 | - |
| Text content | 1 MB | - |
| OpenGraph media | 100 each | - |
Performance
Benchmarks on sample HTML (9KB document):
| Operation | Time | Throughput |
|---|---|---|
| Full parse | ~96 µs | 92 MiB/s |
| 1000 links | ~725 µs | 1.4M links/s |
| Text extraction | ~59 µs | - |
| Schema.org (complex) | ~6 µs | - |
Run benchmarks:
Examples
# Fetch and display webpage info
License
MIT License - see LICENSE for details.