Skip to main content

Crate kreuzcrawl

Crate kreuzcrawl 

Source
Expand description

kreuzcrawl – A Rust crawling engine for turning websites into structured data.

Structs§

ArticleMetadata
Article metadata extracted from article:* Open Graph tags.
BatchCrawlResult
Result from a single URL in a batch crawl operation.
BatchScrapeResult
Result from a single URL in a batch scrape operation.
BrowserConfig
Browser fallback configuration.
CitationReference
CitationResult
Result of citation conversion.
ContentConfig
Content extraction and conversion configuration.
CookieInfo
Information about an HTTP cookie received from a response.
CrawlConfig
Configuration for crawl, scrape, and map operations.
CrawlEngineHandle
Opaque handle to a configured crawl engine.
CrawlPageResult
The result of crawling a single page during a crawl operation.
CrawlResult
The result of a multi-page crawl operation.
DownloadedAsset
A downloaded asset from a page.
DownloadedDocument
A downloaded non-HTML document (PDF, DOCX, image, code file, etc.).
ExtractionMeta
Metadata about an LLM extraction pass.
FaviconInfo
Information about a favicon or icon link.
FeedInfo
Information about a feed link found on a page.
HeadingInfo
A heading element extracted from the page.
HreflangEntry
An hreflang alternate link entry.
ImageInfo
Information about an image found on a page.
JsonLdEntry
A JSON-LD structured data entry found on a page.
LinkInfo
Information about a link found on a page.
MapResult
The result of a map operation, containing discovered URLs.
MarkdownResult
Rich markdown conversion result from HTML processing.
PageMetadata
Metadata extracted from an HTML page’s <meta> tags and <title> element.
ProxyConfig
Proxy configuration for HTTP requests.
ResponseMeta
Response metadata extracted from HTTP headers.
ScrapeResult
The result of a single-page scrape operation.
SitemapUrl
A URL entry from a sitemap.

Enums§

AssetCategory
The category of a downloaded asset.
AuthConfig
Authentication configuration.
BrowserMode
When to use the headless browser fallback.
BrowserWait
Wait strategy for browser page rendering.
CrawlError
Errors that can occur during crawling, scraping, or mapping operations.
FeedType
The type of a feed (RSS, Atom, or JSON Feed).
ImageSource
The source of an image reference.
LinkType
The classification of a link.

Functions§

batch_crawl
Crawl multiple seed URLs concurrently, each following links to configured depth.
batch_scrape
Scrape multiple URLs concurrently.
crawl
Crawl a website starting from url, following links up to the configured depth.
create_engine
Create a new crawl engine with the given configuration.
map_urls
Discover all pages on a website by following links and sitemaps.
scrape
Scrape a single URL, returning extracted page data.