Skip to main content

SamplerConfig

Struct SamplerConfig 

Source
pub struct SamplerConfig {
    pub seed: u64,
    pub batch_size: usize,
    pub ingestion_max_records: usize,
    pub chunking: ChunkingStrategy,
    pub recipes: Vec<TripletRecipe>,
    pub text_recipes: Vec<TextRecipe>,
    pub split: SplitRatios,
    pub allowed_splits: Vec<SplitLabel>,
}
Expand description

Top-level sampler configuration.

Fields§

§seed: u64

RNG seed that controls deterministic sampling order.

§batch_size: usize

Target number of samples per batch.

§ingestion_max_records: usize

Max number of records kept in the ingestion cache for candidate sampling.

This is intentionally decoupled from batch_size so anchors/negatives can be drawn from a broader rolling pool.

Practical tuning: values above batch_size usually improve diversity and reduce short-horizon repetition; gains taper off as source/recipe/split constraints become the limiting factor. Higher values also increase memory.

For remote shard-backed sources (for example Hugging Face), larger initial targets may require fetching more shards before the first batch, so startup latency can increase based on shard sizes and network throughput.

§chunking: ChunkingStrategy

Chunking behavior for long sections.

§recipes: Vec<TripletRecipe>

Triplet recipes to use; empty means sources may provide defaults.

§text_recipes: Vec<TextRecipe>

Text recipes to use; empty means derived from triplet recipes if available.

§split: SplitRatios

Split ratios used when assigning records to train/val/test.

§allowed_splits: Vec<SplitLabel>

Splits allowed for sampling requests.

Implementations§

Source§

impl SamplerConfig

Source

pub fn with_denoiser(self, config: DenoiserConfig) -> Self

Consuming builder to enable the built-in OCR/markdown denoiser on the sampler’s chunking strategy.

Chains denoiser setup during SamplerConfig construction. Works with struct update syntax to customize other fields at the same time:

use triplets_core::{SamplerConfig, config::DenoiserConfig};

// Enable denoiser with all other fields at their defaults:
let config = SamplerConfig::default()
    .with_denoiser(DenoiserConfig { enabled: true, ..DenoiserConfig::default() });

// Or customize other fields first, then add the denoiser:
let config = SamplerConfig { batch_size: 32, ..SamplerConfig::default() }
    .with_denoiser(DenoiserConfig { enabled: true, ..DenoiserConfig::default() });

Trait Implementations§

Source§

impl Clone for SamplerConfig

Source§

fn clone(&self) -> SamplerConfig

Returns a duplicate of the value. Read more
1.0.0 (const: unstable) · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl Debug for SamplerConfig

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl Default for SamplerConfig

Source§

fn default() -> Self

Returns the “default value” for a type. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more