servo-fetch 0.12.1

Fetch, render, and extract web content as Markdown, JSON, or screenshots with an embedded Servo browser engine. No Chromium required.
Documentation

servo-fetch

crates.io docs.rs

Fetch, render, and extract web content as Markdown, JSON, or screenshots with an embedded Servo browser engine. No Chromium, no containers, no external processes.

Looking for the CLI? See servo-fetch-cli.

Features

  • Real JS execution — SpiderMonkey runs JavaScript, parallel CSS engine computes layout
  • Layout- and visibility-aware extraction — strips navbars/footers by rendered position, plus cookie banners, modals, and CSS-hidden content
  • Schema-driven JSON — declarative CSS-selector schema pulls structured data, no LLM
  • Async-first API — top-level functions are async; sync mirror in [blocking] submodule
  • PDF auto-detection — URLs returning PDF are automatically extracted as text
  • Typed errorsError::Timeout, Error::InvalidUrl, etc. for match-based retry logic
  • SSRF protection — blocks private IPs, reserved ranges, and metadata endpoints

Quick Start

let md = servo_fetch::markdown("https://example.com").await?;

For synchronous code, use the [blocking] submodule:

let md = servo_fetch::blocking::markdown("https://example.com")?;

Examples

Fetch with options

use servo_fetch::{fetch, load_cookies, FetchOptions};
use std::time::Duration;

let page = fetch(
    &FetchOptions::new("https://spa-site.com")
        .timeout(Duration::from_secs(60))
        .settle(Duration::from_millis(3000))
        .user_agent("MyBot/1.0")
        .cookies(load_cookies("cookies.txt")?),
).await?;
println!("{}", page.html);
let md = page.markdown()?;

Screenshot

use servo_fetch::{fetch, FetchOptions};

let page = fetch(&FetchOptions::screenshot("https://example.com", true)).await?;
std::fs::write("page.png", page.screenshot_png().unwrap())?;

JavaScript execution

use servo_fetch::{fetch, FetchOptions};

let page = fetch(&FetchOptions::javascript("https://example.com", "document.title")).await?;
println!("{}", page.js_result.unwrap());

Crawl a site

use servo_fetch::{crawl_each, CrawlOptions};

crawl_each(
    &CrawlOptions::new("https://docs.example.com")
        .limit(100)
        .include(&["/docs/**"]),
    |result| match &result.outcome {
        Ok(page) => println!("{}: {} chars", result.url, page.content.len()),
        Err(e) => eprintln!("{}: {e}", result.url),
    },
).await?;

Schema-driven JSON extraction

use servo_fetch::{fetch, FetchOptions};
use servo_fetch::schema::ExtractSchema;

// Load a schema from a file...
let product_schema = ExtractSchema::from_path("schema.json")?;

// ...or from an inline string.
let product_schema = ExtractSchema::from_json(r#"{
    "base_selector": ".product",
    "fields": [
        { "name": "title", "selector": "h2", "type": "text" },
        { "name": "price", "selector": ".price", "type": "text" }
    ]
}"#)?;

let page = fetch(&FetchOptions::new("https://shop.example.com").schema(product_schema)).await?;
if let Some(value) = &page.extracted {
    println!("{}", serde_json::to_string_pretty(value)?);
}

Field type values: text, attribute, html, inner_html, nested_list. An empty selector ("") reads from the matched element itself — useful inside nested_list to grab each item's own text or attribute. For programmatic construction, see [ExtractSchema::builder()].

Error handling

use servo_fetch::{fetch, FetchOptions, Error};

match fetch(&FetchOptions::new(url)).await {
    Ok(page) => { /* ... */ }
    Err(Error::Timeout { .. }) => { /* retry */ }
    Err(Error::AddressNotAllowed { .. }) => { /* skip */ }
    Err(e) => return Err(e.into()),
}

Client

use servo_fetch::Client;
use std::time::Duration;

let client = Client::builder()
    .timeout(Duration::from_secs(60))
    .user_agent("MyBot/1.0")
    .build();

let p1 = client.fetch("https://a.com").await?;
let p2 = client.fetch("https://b.com").await?;

Sync mode

use servo_fetch::blocking;

let md = blocking::markdown("https://example.com")?;
let client = blocking::Client::builder().timeout(std::time::Duration::from_secs(60)).build();
let page = client.fetch("https://example.com")?;

Environment Variables

Variable Description
SERVO_FETCH_USER_AGENT Default User-Agent string (overridden by .user_agent())

API Overview

Every function is available in both async (top-level) and sync (blocking::*) form.

Function Returns
markdown(url) Readable Markdown
text(url) Plain text (innerText)
extract_json(url) Structured JSON
fetch(opts) Page
crawl(opts) Vec<CrawlResult>
crawl_each(opts, cb) Streaming results via callback
map(opts) Vec<MappedUrl> (URL discovery)
Client / ClientBuilder Reusable client with defaults

See docs.rs for the full API reference and examples/ for complete runnable programs.

License

MIT OR Apache-2.0