Skip to main content

Crate triplets_core

Crate triplets_core 

Source
Expand description

Core types, traits, and algorithms for the triplets data pipeline framework.

Re-exports§

pub use chunking::ChunkingAlgorithm;
pub use chunking::SlidingWindowChunker;
pub use config::ChunkingStrategy;
pub use config::DenoiserConfig;
pub use config::NegativeStrategy;
pub use config::SamplerConfig;
pub use config::Selector;
pub use config::TextRecipe;
pub use config::TripletRecipe;
pub use data::DataRecord;
pub use data::PairLabel;
pub use data::QualityScore;
pub use data::RecordChunk;
pub use data::SampleBatch;
pub use data::SamplePair;
pub use data::SampleTriplet;
pub use data::SectionRole;
pub use data::TextBatch;
pub use data::TextSample;
pub use data::TripletBatch;
pub use hash::stable_hash_str;
pub use ingestion::IngestionManager;
pub use ingestion::RecordCache;
pub use kvp::KvpField;
pub use kvp::KvpPrefixSampler;
pub use preprocessor::TextPreprocessor;
pub use preprocessor::backends::denoiser_preprocessor::DenoiserPreprocessor;
pub use sampler::BatchPrefetcher;
pub use sampler::Sampler;
pub use sampler::TripletSampler;
pub use source::InMemorySource;
pub use source::backends::csv_source::CsvSource;
pub use source::backends::csv_source::CsvSourceConfig;
pub use source::DataSource;
pub use source::SourceCursor;
pub use splits::DeterministicSplitStore;
pub use splits::FileSplitStore;
pub use splits::SplitLabel;
pub use splits::SplitRatios;
pub use splits::SplitStore;
pub use types::CategoryId;
pub use types::HashPart;
pub use types::KvpValue;
pub use types::LogMessage;
pub use types::MetaValue;
pub use types::PathString;
pub use types::RecipeKey;
pub use types::RecordId;
pub use types::Sentence;
pub use types::SourceId;
pub use types::TaxonomyValue;

Modules§

chunking
Pluggable chunking algorithms and default sliding-window implementation. Pluggable chunking algorithms.
config
Sampling configuration types.
constants
Centralized constants used across sampler, splits, and sources.
data
Data record and sample batch types.
hash
Stable deterministic hashing utilities.
heuristics
Capacity and sampling estimation helpers.
ingestion
Background ingestion and caching infrastructure.
kvp
Key/value prefix sampling helpers.
metadata
Metadata keys and helpers.
metrics
Aggregate metrics helpers.
preprocessor
OCR denoising and markdown-table cleanup for text chunks. Pluggable text preprocessor infrastructure.
sampler
Sampler implementations and public sampling API.
source
Data source traits and built-in sources. Data source interfaces and paging helpers.
splits
Split stores and persistence helpers.
tokenizer
Structural text tokenizer trait and whitespace implementation. Tokenization primitives used across chunking, sampling, and BM25 indexing.
types
Shared type aliases.
utils
Text normalization helpers. Text normalization helpers shared by source implementations.

Enums§

SamplerError
Error type for sampler configuration, IO, and persistence failures.