pub struct Web2llmConfig {
pub user_agent: String,
pub timeout: Duration,
pub block_private_hosts: bool,
pub sensitivity: f32,
pub robots_check: bool,
pub rate_limit: u32,
pub max_concurrency: usize,
}Expand description
User-facing configuration for the web2llm pipeline.
Controls fetch behavior and request identity.
Use Web2llmConfig::default() for sensible defaults.
Fields§
§user_agent: StringThe user-agent string sent with every HTTP request.
Also used for robots.txt compliance checks.
timeout: DurationMaximum time to wait for a response before giving up.
block_private_hosts: boolIf true, requests to private, loopback, and link-local addresses are
rejected during pre-flight validation. This prevents SSRF attacks when
web2llm is used in a service that accepts user-supplied URLs.
Set to false if you need to fetch from localhost or internal hosts
in a trusted environment, such as local development or testing.
Defaults to true.
sensitivity: f32Controls how aggressively secondary content is filtered.
A value of 0.1 keeps everything within 10x of the best scoring branch.
A value of 0.5 keeps only branches close to the best.
Defaults to 0.1.
robots_check: boolIf true, the pipeline will fetch and respect robots.txt before
downloading the target page.
Defaults to true.
rate_limit: u32The maximum number of requests allowed per second.
Defaults to 5.
max_concurrency: usizeThe maximum number of concurrent requests allowed across the whole pipeline.
Defaults to 10.
Implementations§
Source§impl Web2llmConfig
impl Web2llmConfig
Sourcepub fn new(
user_agent: String,
timeout: Duration,
block_private_hosts: bool,
sensitivity: f32,
rate_limit: u32,
max_concurrency: usize,
) -> Self
pub fn new( user_agent: String, timeout: Duration, block_private_hosts: bool, sensitivity: f32, rate_limit: u32, max_concurrency: usize, ) -> Self
Creates a new Web2llmConfig with the specified values.
Sourcepub fn with_robots_check(self, check: bool) -> Self
pub fn with_robots_check(self, check: bool) -> Self
Builder-style method to set whether to check robots.txt.