Expand description
kreuzcrawl – A Rust crawling engine for turning websites into structured data.
Structs§
- Article
Metadata - Article metadata extracted from
article:*Open Graph tags. - Batch
Crawl Result - Result from a single URL in a batch crawl operation.
- Batch
Scrape Result - Result from a single URL in a batch scrape operation.
- Browser
Config - Browser fallback configuration.
- Citation
Reference - Citation
Result - Result of citation conversion.
- Content
Config - Content extraction and conversion configuration.
- Cookie
Info - Information about an HTTP cookie received from a response.
- Crawl
Config - Configuration for crawl, scrape, and map operations.
- Crawl
Engine Handle - Opaque handle to a configured crawl engine.
- Crawl
Page Result - The result of crawling a single page during a crawl operation.
- Crawl
Result - The result of a multi-page crawl operation.
- Downloaded
Asset - A downloaded asset from a page.
- Downloaded
Document - A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).
- Extraction
Meta - Metadata about an LLM extraction pass.
- Favicon
Info - Information about a favicon or icon link.
- Feed
Info - Information about a feed link found on a page.
- Heading
Info - A heading element extracted from the page.
- Hreflang
Entry - An hreflang alternate link entry.
- Image
Info - Information about an image found on a page.
- Json
LdEntry - A JSON-LD structured data entry found on a page.
- Link
Info - Information about a link found on a page.
- MapResult
- The result of a map operation, containing discovered URLs.
- Markdown
Result - Rich markdown conversion result from HTML processing.
- Page
Metadata - Metadata extracted from an HTML page’s
<meta>tags and<title>element. - Proxy
Config - Proxy configuration for HTTP requests.
- Response
Meta - Response metadata extracted from HTTP headers.
- Scrape
Result - The result of a single-page scrape operation.
- Sitemap
Url - A URL entry from a sitemap.
Enums§
- Asset
Category - The category of a downloaded asset.
- Auth
Config - Authentication configuration.
- Browser
Mode - When to use the headless browser fallback.
- Browser
Wait - Wait strategy for browser page rendering.
- Crawl
Error - Errors that can occur during crawling, scraping, or mapping operations.
- Feed
Type - The type of a feed (RSS, Atom, or JSON Feed).
- Image
Source - The source of an image reference.
- Link
Type - The classification of a link.
Functions§
- batch_
crawl - Crawl multiple seed URLs concurrently, each following links to configured depth.
- batch_
scrape - Scrape multiple URLs concurrently.
- crawl
- Crawl a website starting from
url, following links up to the configured depth. - create_
engine - Create a new crawl engine with the given configuration.
- map_
urls - Discover all pages on a website by following links and sitemaps.
- scrape
- Scrape a single URL, returning extracted page data.