triplets
Work in progress.
Generate an effectively unlimited stream of training triplets, pairs, or plaintext samples from your existing corpus. This crate handles ingestion, multi-source mixing, deterministic train/validation/test splitting, and optional BM25 hard-negative mining.
Overview
In metric learning and language model training, a triplet consists of an anchor, a positive example (similar to the anchor), and a negative example (dissimilar to the anchor).
triplets provides a high-throughput streaming pipeline to:
- Ingest data from local text/CSV files, Hugging Face, or custom backends.
- Mix sources with configurable weights to balance your training data.
- Split data deterministically into train, validation, and test sets.
- Sample triplets or pairs using rule-based "recipes".
- Mine hard negatives using BM25 to improve model discrimination.
Anchor
/ \
Positive Negative
Triplet: (Anchor, Positive, Negative)
Getting Started
A TripletSampler needs a SplitStore for record-to-split assignments and a SamplerConfig for runtime behavior.
use Arc;
use ;
Features
| Feature | What it enables | Default |
|---|---|---|
huggingface |
Streaming from Hugging Face dataset repositories. | No |
bm25-mining |
BM25 hard-negative ranking within strategy-defined pools. | No |
extended-metrics |
Additional per-triplet diagnostics for debugging. | No |
Configuring Sources
Hugging Face Source
Streams rows directly from the Hugging Face Hub without requiring a full dataset download. Map dataset columns to anchor, positive, or plain-text roles the same way as the CSV source.
CSV Source
Load rows from a CSV file with explicit column mappings. The file must have a named header row — columns are always selected by name. Supports two modes:
- Role mode — map separate columns to anchor and positive (context) roles.
- Text mode — map a single column for SimCSE-style contrastive pre-training.
use Arc;
use ;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
// Role mode: map "question" → anchor, "answer" → positive.
let config = new
.with_anchor_column
.with_positive_column
.with_trust;
let source = new.unwrap;
sampler.register_source;
// Text mode (SimCSE): single column used for both anchor and context.
let config2 = new
.with_text_column;
let source2 = new.unwrap;
sampler.register_source;
Rows with empty required fields are skipped. Column name matching is case-insensitive.
Text File Source
Recursively indexes plain-text files from a directory. Each file's stem (filename without extension) becomes the anchor and its body content becomes the context. Useful for local corpora where files are already titled meaningfully.
use Arc;
use ;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
// Point at a directory; all text files are indexed recursively.
// The filename stem is the anchor; the file body is the context.
let config = new
.with_text_files_only
.with_trust; // Assign a quality score to this source
let source = new;
sampler.register_source;
Implement the IndexableSource trait to integrate any backend that can fetch records by a stable integer index.
use Arc;
use ;
use Utc;
use ;
use ;
;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
let adapter = new;
sampler.register_source;
Sampling and Mixing
Weighted Sampling
Adjust per-source sampling frequency to handle class imbalance or dataset quality differences.
use Arc;
use ;
use HashMap;
Recipe Selection Weights
The weight field on TripletRecipe controls how often a recipe is selected relative to other active recipes. The sampler expands each recipe into a proportional number of selection slots, shuffles them, and cycles through — so a recipe with weight = 3.0 is drawn approximately three times as often as one with weight = 1.0.
weight value |
Effect |
|---|---|
Equal across all recipes (e.g. all 1.0) |
Uniform round-robin — each recipe is selected equally often (default behavior). |
2.0 vs 1.0 |
The 2.0 recipe is tried ~2× as often per batch. |
0.0 or negative |
Recipe is excluded entirely — useful for disabling a recipe without removing it from configuration. |
use ;
let config = SamplerConfig ;
Sampling frequency vs. output score:
TripletRecipe::weightcontrols how often the recipe is selected. It is also one factor in the outputSampleTriplet::weight, but the two serve different roles — see Output Format below.
Output Format
Each SampleTriplet contains the sampled text and a computed training score.
use Arc;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
let batch = sampler.next_triplet_batch.unwrap;
for triplet in batch.triplets
What triplet.weight means and how it is calculated
SampleTriplet::weight is a per-triplet training score in the range (0.0, recipe.weight]. Use it to scale each triplet's contribution to the loss — triplets that are more structurally coherent or come from higher-trust sources receive a higher score.
The value is computed as triplet.weight = recipe.weight × chunk_quality, where chunk_quality is the average of three per-slot signals (one per chunk: anchor, positive, negative). Each signal is the product of two independent factors:
| Factor | What it measures | How it is set |
|---|---|---|
| Window position score | 1 / (window_index + 1) — earlier chunks in a section score higher (1.0 at index 0, 0.5 at index 1, 0.25 at index 3, …). |
Automatic. |
| Source trust | Configured quality signal for the originating source (clamped to [0, 1]). |
Set via .with_trust(0.9) on the source config. |
The resulting raw signal is clamped to [chunk_weight_floor, 1.0] (default floor: 0.1) before averaging.
The anchor/positive pair additionally has a proximity multiplier applied: chunks that are closer together within the same section receive a higher multiplier (two adjacent windows score 1.0; the score decreases as window distance grows). This rewards pairs that share local context.
A practical reading: a triplet from a high-trust source where all three chunks come from the opening windows of their sections will have chunk_quality ≈ 1.0, so triplet.weight ≈ recipe.weight. A triplet with chunks deep in long documents from a lower-trust source will have a noticeably smaller score.
In a training loop pass the weight straight into your criterion:
use Arc;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
let batch = sampler.next_triplet_batch.unwrap;
// Example: accumulate weighted loss over a batch.
let _weighted_loss: f32 = batch.triplets.iter.map.sum;
Source Within a Source
Each TripletRecipe is an independent code path over the sections of a record. Two recipes registered against the same source can express completely different training hypotheses about the same underlying data — no second source registration needed.
The mechanism is straightforward:
- Populate each
DataRecord::sectionswith as manyRecordSectionentries as your data has natural views. - Assign each section a
SectionRole(or let position carry the meaning withSelector::Paragraph(n)). - Write one
TripletRecipeper hypothesis; each recipe independently specifies which sections fill the anchor, positive, and negative slots. - Sources declare their own recipes via
default_triplet_recipes()so callers need no recipe configuration at all.
Sparse sections — optional data in the same record pool
Not every record needs to have all sections. If a recipe targets Selector::Paragraph(2) (the third section) and a record only has two sections, the sampler simply skips that record for that recipe only — the record continues to serve all other recipes normally. This lets you mix densely-covered and sparsely-covered training hypotheses in a single source without any record filtering logic in your data pipeline.
Example — financial data source with two recipe strategies
Imagine each record represents one publicly-traded company with up to three sections:
| Index | Role | Content | Always present? |
|---|---|---|---|
| 0 | Anchor |
Linearized financial metrics — view A (a random tag subset) | Yes |
| 1 | Context |
Linearized financial metrics — view B (a disjoint tag subset) | Yes |
| 2 | (positional) | Earnings-call transcript for the same period | No — only when a transcript was found |
Two recipes target different aspects of the same records:
use ;
use SectionRole;
/// Cross-view recipe: both metric views are always present, so every record
/// participates. Teaches the model that two different linearized views of the
/// same company are semantically closer than any view of a different company.
/// Transcript recipe: targets an optional third section (index 2).
/// Records without a transcript are skipped for *this recipe only* —
/// they still serve the metrics_cross_view recipe above without any
/// record filtering logic in the data pipeline.
///
/// Lower weight reflects partial coverage: fewer records satisfy this
/// recipe, so letting it drive the same number of gradient steps as the
/// dense recipe would over-represent the companies with transcripts.
The source returns both recipes from default_triplet_recipes() so that no recipe configuration is needed at the call site:
use TripletRecipe;
use ;
use ;
# use ;
# use SectionRole;
#
#
When the sampler processes a record that has only two sections, it attempts each recipe in weighted order: metrics_cross_view succeeds (both Role(Anchor) and Role(Context) sections are present), while metrics_to_transcript returns no candidate for that slot (section index 2 is absent). The sampler moves on without any special handling in the data pipeline.
The same single register_source call enables both training hypotheses:
use Arc;
use ;
# ;
#
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
// One registration — the source provides both recipes.
sampler.register_source;
let batch = sampler.next_triplet_batch.unwrap;
// batch.triplets is a mix of "metrics_cross_view" and "metrics_to_transcript"
// samples, proportional to their configured weights and record coverage.
Epochs and Determinism
Iterating Epochs
In a typical training loop, signal a new epoch so the sampler can reset cursors and reshuffle sources.
use Arc;
use ;
Deterministic Resuming
To resume training, initialize a FileSplitStore at the same path. The sampler automatically restores cursors, RNG state, and epoch progress from that store.
use Arc;
use ;
Note: Sampler state is intentionally lightweight. It persists source identifiers, integer record cursors, and compact RNG state vectors, not full data records. This keeps frequent checkpointing practical in long-running training jobs.
Technical Details
Threading Model
Concurrency is handled at multiple levels for high throughput:
- Prefetching:
BatchPrefetcherruns a dedicated background worker thread that fills a bounded queue. - Parallel Ingestion: Source refresh executes concurrently across registered sources during ingestion cycles.
- Synchronous API: Sampling calls are synchronous at the API boundary for straightforward training-loop integration.
- Thread-Safe Shared Use:
TripletSampleris safe to share across threads (for example viaArc); concurrent calls are internally synchronized with a mutex, so a single sampler instance is callable from multiple threads without data races.
Chunking and Windows
Long documents are handled through a pluggable ChunkingAlgorithm. The default SlidingWindowChunker splits sections into fixed-size token windows with configurable overlap, preserving full coverage of long text.
Negative Mining
Negative selection is delegated to a pluggable backend.
- DefaultBackend: Uniform random selection from the candidate pool.
- Bm25Backend: (Requires
bm25-mining) Ranks candidates by lexical overlap with the anchor to provide harder training examples.
Capabilities
| Capability | Description |
|---|---|
| Source Agnostic | Implement DataSource or IndexableSource for any DB or API. |
| Weighted Sampling | Tune source and recipe frequencies to handle class imbalance. |
| Epoch Shuffling | Deterministic pseudo-random shuffling that re-permutes per epoch. |
| Instruction Tuning | Attach task-specific prompts (e.g., "Summarize this...") to specific recipes. |
| Metadata Decorators | Inject structured prefixes into sampled text via KvpPrefixSampler. |
| Anti-Shortcut | Includes anchor/positive swapping to avoid asymmetric slot bias. |
License
triplets is distributed under both the MIT license and the Apache License (Version 2.0).
See LICENSE-APACHE and LICENSE-MIT for details.