triplets

Work in progress.

Generate an effectively unlimited stream of training triplets, pairs, or plaintext samples from your existing corpus. This crate handles ingestion, multi-source mixing, deterministic train/validation/test splitting, and optional BM25 hard-negative mining.

Overview

In metric learning and language model training, a triplet consists of an anchor, a positive example (similar to the anchor), and a negative example (dissimilar to the anchor).

triplets provides a high-throughput streaming pipeline to:

Ingest data from local text/CSV files, Hugging Face, or custom backends.
Mix sources with configurable weights to balance your training data.
Split data deterministically into train, validation, and test sets.
Sample triplets or pairs using rule-based "recipes".
Mine hard negatives using BM25 to improve model discrimination.

      Anchor
      /    \
 Positive Negative

 Triplet: (Anchor, Positive, Negative)

Getting Started

A TripletSampler needs a SplitStore for record-to-split assignments and a SamplerConfig for runtime behavior.

use std::sync::Arc;
use triplets::{
    SamplerConfig, TripletSampler, SplitRatios, 
    DeterministicSplitStore, SplitLabel, Sampler
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Define your train/validation/test ratios (e.g., 80/10/10).
    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };

    // 2. Initialize a deterministic split store.
    // The seed ensures record IDs are always assigned to the same split.
    let seed = 42;
    let store = Arc::new(DeterministicSplitStore::new(ratios, seed)?);

    // 3. Create the sampler.
    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
    Ok(())
}

Features

Feature	What it enables	Default
`huggingface`	Streaming from Hugging Face dataset repositories.	No
`bm25-mining`	BM25 hard-negative ranking within strategy-defined pools.	No
`extended-metrics`	Additional per-triplet diagnostics for debugging.	No

Configuring Sources

Hugging Face Source

Streams rows directly from the Hugging Face Hub without requiring a full dataset download. Map dataset columns to anchor, positive, or plain-text roles the same way as the CSV source.

#[cfg(feature = "huggingface")]
{
    use std::sync::Arc;
    use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, Sampler};
    use triplets::{HuggingFaceRowSource, HuggingFaceRowsConfig};

    fn main() -> Result<(), Box<dyn std::error::Error>> {
        let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
        let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);
        let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
        // Configure the source to pull the "train" split of a dataset.
        // Note: While we specify "train" here as the ingestion source, the crate
        // automatically handles its own deterministic split assignments (train/val/test)
        // at the record level across all loaded data.
        let config = HuggingFaceRowsConfig::new(
            "hf_finance",          // Source identifier
            "financial_phrasebank", // HF Dataset name
            "default",             // Dataset config
            "train",               // Dataset split
            "cache/hf_snapshots"   // Local cache for downloaded shards
        );

        let source = HuggingFaceRowSource::new(config)?;
        sampler.register_source(Box::new(source));
        Ok(())
    }
}

CSV Source

Load rows from a CSV file with explicit column mappings. The file must have a named header row — columns are always selected by name. Supports two modes:

Role mode — map separate columns to anchor and positive (context) roles.
Text mode — map a single column for SimCSE-style contrastive pre-training.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};
use triplets::source::{CsvSource, CsvSourceConfig};

let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);

// Role mode: map "question" → anchor, "answer" → positive.
let config = CsvSourceConfig::new("qna", "data/qna.csv")
    .with_anchor_column("question")
    .with_positive_column("answer")
    .with_trust(0.9);
let source = CsvSource::new(config).unwrap();
sampler.register_source(Box::new(source));

// Text mode (SimCSE): single column used for both anchor and context.
let config2 = CsvSourceConfig::new("corpus", "data/corpus.csv")
    .with_text_column("text");
let source2 = CsvSource::new(config2).unwrap();
sampler.register_source(Box::new(source2));

Rows with empty required fields are skipped. Column name matching is case-insensitive.

Text File Source

Recursively indexes plain-text files from a directory. Each file's stem (filename without extension) becomes the anchor and its body content becomes the context. Useful for local corpora where files are already titled meaningfully.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};
use triplets::source::{FileSource, FileSourceConfig};

let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
// Point at a directory; all text files are indexed recursively.
// The filename stem is the anchor; the file body is the context.
let config = FileSourceConfig::new("docs", "./data/corpus")
    .with_text_files_only(true)
    .with_trust(0.9); // Assign a quality score to this source

let source = FileSource::new(config);
sampler.register_source(Box::new(source));

Implement the IndexableSource trait to integrate any backend that can fetch records by a stable integer index.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};
use chrono::Utc;
use triplets::{DataRecord, SamplerError};
use triplets::source::{IndexableSource, IndexableAdapter};

struct MyApiSource;

impl IndexableSource for MyApiSource {
    fn id(&self) -> &str { "api_source" }
    fn len_hint(&self) -> Option<usize> { Some(1000) }
    fn record_at(&self, idx: usize) -> Result<Option<DataRecord>, SamplerError> {
        // Fetch record 'idx' from your database or API.
        Ok(Some(DataRecord {
            id: format!("api_{idx}"),
            source: self.id().into(),
            created_at: Utc::now(),
            updated_at: Utc::now(),
            quality: Default::default(),
            taxonomy: vec![],
            sections: vec![], // Add text content here
            meta_prefix: None,
        }))
    }
}

let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
let adapter = IndexableAdapter::new(MyApiSource);
sampler.register_source(Box::new(adapter));

Sampling and Mixing

Weighted Sampling

Adjust per-source sampling frequency to handle class imbalance or dataset quality differences.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};
use std::collections::HashMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
    let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);
    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
    // Pull from HF 70% of the time and local files 30% of the time.
    let mut weights = HashMap::new();
    weights.insert("hf_finance".to_string(), 0.7);
    weights.insert("docs".to_string(), 0.3);

    let batch = sampler.next_triplet_batch_with_weights(SplitLabel::Train, &weights)?;
    Ok(())
}

Recipe Selection Weights

The weight field on TripletRecipe controls how often a recipe is selected relative to other active recipes. The sampler expands each recipe into a proportional number of selection slots, shuffles them, and cycles through — so a recipe with weight = 3.0 is drawn approximately three times as often as one with weight = 1.0.

`weight` value	Effect
Equal across all recipes (e.g. all `1.0`)	Uniform round-robin — each recipe is selected equally often (default behavior).
`2.0` vs `1.0`	The `2.0` recipe is tried ~2× as often per batch.
`0.0` or negative	Recipe is excluded entirely — useful for disabling a recipe without removing it from configuration.

use triplets::{SamplerConfig, TripletRecipe, NegativeStrategy, Selector, SectionRole};

let config = SamplerConfig {
    recipes: vec![
        // High-signal structured pairs: tried 3× as often as the fallback.
        TripletRecipe {
            name: "structured".into(),
            anchor: Selector::Role(SectionRole::Anchor),
            positive_selector: Selector::Role(SectionRole::Context),
            negative_selector: Selector::Random,
            negative_strategy: NegativeStrategy::WrongArticle,
            weight: 3.0,
            instruction: None,
            allow_same_anchor_positive: false,
        },
        // Fallback recipe with random chunk selection.
        TripletRecipe {
            name: "random_fallback".into(),
            anchor: Selector::Random,
            positive_selector: Selector::Random,
            negative_selector: Selector::Random,
            negative_strategy: NegativeStrategy::WrongArticle,
            weight: 1.0,
            instruction: None,
            allow_same_anchor_positive: false,
        },
        // Disabled recipe — excluded from sampling until weight is set above zero.
        TripletRecipe {
            name: "experimental".into(),
            anchor: Selector::Random,
            positive_selector: Selector::Random,
            negative_selector: Selector::Random,
            negative_strategy: NegativeStrategy::WrongArticle,
            weight: 0.0,
            instruction: None,
            allow_same_anchor_positive: false,
        },
    ],
    ..SamplerConfig::default()
};

Sampling frequency vs. output score: TripletRecipe::weight controls how often the recipe is selected. It is also one factor in the output SampleTriplet::weight, but the two serve different roles — see Output Format below.

Output Format

Each SampleTriplet contains the sampled text and a computed training score.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};
let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
let batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();
for triplet in batch.triplets {
    // Primary content
    let anchor_text = &triplet.anchor.text;
    let pos_text    = &triplet.positive.text;
    let neg_text    = &triplet.negative.text;

    // Metadata
    let recipe      = &triplet.recipe;      // which recipe produced this triplet
    let weight      = triplet.weight;       // training score — see below
    let instruction = triplet.instruction;  // optional task instruction string
}

What `triplet.weight` means and how it is calculated

SampleTriplet::weight is a per-triplet training score in the range (0.0, recipe.weight]. Use it to scale each triplet's contribution to the loss — triplets that are more structurally coherent or come from higher-trust sources receive a higher score.

The value is computed as triplet.weight = recipe.weight × chunk_quality, where chunk_quality is the average of three per-slot signals (one per chunk: anchor, positive, negative). Each signal is the product of two independent factors:

Factor	What it measures	How it is set
Window position score	`1 / (window_index + 1)` — earlier chunks in a section score higher (1.0 at index 0, 0.5 at index 1, 0.25 at index 3, …).	Automatic.
Source trust	Configured quality signal for the originating source (clamped to `[0, 1]`).	Set via `.with_trust(0.9)` on the source config.

The resulting raw signal is clamped to [chunk_weight_floor, 1.0] (default floor: 0.1) before averaging.

The anchor/positive pair additionally has a proximity multiplier applied: chunks that are closer together within the same section receive a higher multiplier (two adjacent windows score 1.0; the score decreases as window distance grows). This rewards pairs that share local context.

A practical reading: a triplet from a high-trust source where all three chunks come from the opening windows of their sections will have chunk_quality ≈ 1.0, so triplet.weight ≈ recipe.weight. A triplet with chunks deep in long documents from a lower-trust source will have a noticeably smaller score.

In a training loop pass the weight straight into your criterion:

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};
let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
let batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();
// Example: accumulate weighted loss over a batch.
let _weighted_loss: f32 = batch.triplets.iter().map(|t| {
    let triplet_loss = 0.0_f32; // replace with your model's per-triplet loss
    triplet_loss * t.weight
}).sum();

Source Within a Source

Each TripletRecipe is an independent code path over the sections of a record. Two recipes registered against the same source can express completely different training hypotheses about the same underlying data — no second source registration needed.

The mechanism is straightforward:

Populate each DataRecord::sections with as many RecordSection entries as your data has natural views.
Assign each section a SectionRole (or let position carry the meaning with Selector::Paragraph(n)).
Write one TripletRecipe per hypothesis; each recipe independently specifies which sections fill the anchor, positive, and negative slots.
Sources declare their own recipes via default_triplet_recipes() so callers need no recipe configuration at all.

Sparse sections — optional data in the same record pool

Not every record needs to have all sections. If a recipe targets Selector::Paragraph(2) (the third section) and a record only has two sections, the sampler simply skips that record for that recipe only — the record continues to serve all other recipes normally. This lets you mix densely-covered and sparsely-covered training hypotheses in a single source without any record filtering logic in your data pipeline.

Example — financial data source with two recipe strategies

Imagine each record represents one publicly-traded company with up to three sections:

Index	Role	Content	Always present?
0	`Anchor`	Linearized financial metrics — view A (a random tag subset)	Yes
1	`Context`	Linearized financial metrics — view B (a disjoint tag subset)	Yes
2	(positional)	Earnings-call transcript for the same period	No — only when a transcript was found

Two recipes target different aspects of the same records:

use triplets::config::{NegativeStrategy, Selector, TripletRecipe};
use triplets::data::SectionRole;

/// Cross-view recipe: both metric views are always present, so every record
/// participates. Teaches the model that two different linearized views of the
/// same company are semantically closer than any view of a different company.
fn metrics_cross_view_recipe() -> TripletRecipe {
    TripletRecipe {
        name: "metrics_cross_view".into(),
        // Anchor: metric view A.
        anchor: Selector::Role(SectionRole::Anchor),
        // Positive: metric view B — disjoint tags, same company and period.
        positive_selector: Selector::Role(SectionRole::Context),
        // Negative: metric view A of a different company.
        negative_selector: Selector::Role(SectionRole::Anchor),
        negative_strategy: NegativeStrategy::WrongArticle,
        weight: 1.0,
        instruction: None,
        allow_same_anchor_positive: false,
    }
}

/// Transcript recipe: targets an optional third section (index 2).
/// Records without a transcript are skipped for *this recipe only* —
/// they still serve the metrics_cross_view recipe above without any
/// record filtering logic in the data pipeline.
///
/// Lower weight reflects partial coverage: fewer records satisfy this
/// recipe, so letting it drive the same number of gradient steps as the
/// dense recipe would over-represent the companies with transcripts.
fn metrics_to_transcript_recipe() -> TripletRecipe {
    TripletRecipe {
        name: "metrics_to_transcript".into(),
        // Anchor: metric view A.
        anchor: Selector::Role(SectionRole::Anchor),
        // Positive: earnings-call transcript at section index 2.
        // Records that lack this section are skipped for this recipe.
        positive_selector: Selector::Paragraph(2),
        // Negative: metric view A of a different company.
        negative_selector: Selector::Role(SectionRole::Anchor),
        negative_strategy: NegativeStrategy::WrongArticle,
        // Half the weight of the dense recipe; adjust as transcript coverage grows.
        weight: 0.5,
        instruction: None,
        allow_same_anchor_positive: false,
    }
}

The source returns both recipes from default_triplet_recipes() so that no recipe configuration is needed at the call site:

use triplets::config::TripletRecipe;
use triplets::source::{DataSource, IndexablePager, IndexableSource, SourceCursor, SourceSnapshot};
use triplets::{DataRecord, SamplerConfig, SamplerError};

# use triplets::config::{NegativeStrategy, Selector};
# use triplets::data::SectionRole;
# fn metrics_cross_view_recipe() -> TripletRecipe { TripletRecipe { name: "".into(), anchor: Selector::Random, positive_selector: Selector::Random, negative_selector: Selector::Random, negative_strategy: NegativeStrategy::WrongArticle, weight: 1.0, instruction: None, allow_same_anchor_positive: false } }
# fn metrics_to_transcript_recipe() -> TripletRecipe { metrics_cross_view_recipe() }
struct FinancialReportsSource { /* store handle, symbol index, … */ }

impl IndexableSource for FinancialReportsSource {
    fn id(&self) -> &str { "financial_reports" }
    fn len_hint(&self) -> Option<usize> { Some(5000) }

    fn record_at(&self, _idx: usize) -> Result<Option<DataRecord>, SamplerError> {
        // Build a record with 2 or 3 sections depending on transcript availability.
        // Sparse records (None returns) are skipped entirely by the pager.
        Ok(None) // replace with real record construction
    }
}

impl DataSource for FinancialReportsSource {
    fn id(&self) -> &str { "financial_reports" }

    fn refresh(
        &self,
        _config: &SamplerConfig,
        cursor: Option<&SourceCursor>,
        limit: Option<usize>,
    ) -> Result<SourceSnapshot, SamplerError> {
        IndexablePager::new(DataSource::id(self)).refresh(self, cursor, limit)
    }

    fn reported_record_count(&self, _config: &SamplerConfig) -> Result<u128, SamplerError> {
        Ok(5000)
    }

    /// Source declares its own recipes — no recipe config required at call site.
    fn default_triplet_recipes(&self) -> Vec<TripletRecipe> {
        vec![
            metrics_cross_view_recipe(),      // dense: all records, weight 1.0
            metrics_to_transcript_recipe(),   // sparse: records with transcripts, weight 0.5
        ]
    }
}

When the sampler processes a record that has only two sections, it attempts each recipe in weighted order: metrics_cross_view succeeds (both Role(Anchor) and Role(Context) sections are present), while metrics_to_transcript returns no candidate for that slot (section index 2 is absent). The sampler moves on without any special handling in the data pipeline.

The same single register_source call enables both training hypotheses:

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};

# struct FinancialReportsSource;
# impl triplets::source::DataSource for FinancialReportsSource {
#   fn id(&self) -> &str { "financial_reports" }
#   fn refresh(&self, _: &SamplerConfig, _: Option<&triplets::source::SourceCursor>, _: Option<usize>) -> Result<triplets::source::SourceSnapshot, triplets::SamplerError> { unimplemented!() }
#   fn reported_record_count(&self, _: &SamplerConfig) -> Result<u128, triplets::SamplerError> { Ok(0) }
# }
let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
let store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());
let mut sampler = TripletSampler::new(SamplerConfig::default(), store);

// One registration — the source provides both recipes.
sampler.register_source(Box::new(FinancialReportsSource { /* … */ }));

let batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();
// batch.triplets is a mix of "metrics_cross_view" and "metrics_to_transcript"
// samples, proportional to their configured weights and record coverage.

Epochs and Determinism

Iterating Epochs

In a typical training loop, signal a new epoch so the sampler can reset cursors and reshuffle sources.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};
fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
    let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);
    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
    let mut batches_left = 1;
    let mut training_not_finished = || {
        let ret = batches_left > 0;
        batches_left -= 1;
        ret
    };
    // In your training loop:
    for epoch in 0..10 {
        sampler.set_epoch(epoch)?;

        while training_not_finished() {
            let batch = sampler.next_triplet_batch(SplitLabel::Train)?;
            // ... pass batch to your model ...
        }

        // Save state at the end of each epoch to allow resuming if training is interrupted.
        sampler.save_sampler_state(None)?;
    }

    Ok(())
}

Deterministic Resuming

To resume training, initialize a FileSplitStore at the same path. The sampler automatically restores cursors, RNG state, and epoch progress from that store.

use std::sync::Arc;
use triplets::{SamplerConfig, TripletSampler, FileSplitStore, SplitRatios, Sampler};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };
    let seed = 42;

    // Opening an existing FileSplitStore automatically loads its persisted state.
    let store = Arc::new(FileSplitStore::open("checkpoints/splits.bin", ratios, seed)?);

    // The sampler will resume from the exact record and recipe it was on.
    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);
    Ok(())
}

Note: Sampler state is intentionally lightweight. It persists source identifiers, integer record cursors, and compact RNG state vectors, not full data records. This keeps frequent checkpointing practical in long-running training jobs.

Technical Details

Threading Model

Concurrency is handled at multiple levels for high throughput:

Prefetching: BatchPrefetcher runs a dedicated background worker thread that fills a bounded queue.
Parallel Ingestion: Source refresh executes concurrently across registered sources during ingestion cycles.
Synchronous API: Sampling calls are synchronous at the API boundary for straightforward training-loop integration.
Thread-Safe Shared Use: TripletSampler is safe to share across threads (for example via Arc); concurrent calls are internally synchronized with a mutex, so a single sampler instance is callable from multiple threads without data races.

Chunking and Windows

Long documents are handled through a pluggable ChunkingAlgorithm. The default SlidingWindowChunker splits sections into fixed-size token windows with configurable overlap, preserving full coverage of long text.

Negative Mining

Negative selection is delegated to a pluggable backend.

DefaultBackend: Uniform random selection from the candidate pool.
Bm25Backend: (Requires bm25-mining) Ranks candidates by lexical overlap with the anchor to provide harder training examples.

Capabilities

Capability	Description
Source Agnostic	Implement `DataSource` or `IndexableSource` for any DB or API.
Weighted Sampling	Tune source and recipe frequencies to handle class imbalance.
Epoch Shuffling	Deterministic pseudo-random shuffling that re-permutes per epoch.
Instruction Tuning	Attach task-specific prompts (e.g., "Summarize this...") to specific recipes.
Metadata Decorators	Inject structured prefixes into sampled text via `KvpPrefixSampler`.
Anti-Shortcut	Includes anchor/positive swapping to avoid asymmetric slot bias.

License

triplets is distributed under both the MIT license and the Apache License (Version 2.0).

See LICENSE-APACHE and LICENSE-MIT for details.

triplets 0.15.1-alpha