triplets
WORK IN PROGRESS.
Generate an effectively unlimited stream of training triplets, pairs, or plaintext samples from your existing corpus. This crate handles ingestion, multi-source mixing, deterministic train/validation/test splitting, and optional BM25 hard-negative mining.
Overview
In metric learning and language model training, a triplet consists of an anchor, a positive example (similar to the anchor), and a negative example (dissimilar to the anchor).
triplets provides a high-throughput streaming pipeline to:
- Ingest data from local files, Hugging Face, or custom backends.
- Mix sources with configurable weights to balance your training data.
- Split data deterministically into train, validation, and test sets.
- Sample triplets or pairs using rule-based "recipes".
- Mine hard negatives using BM25 to improve model discrimination.
Anchor
/ \
Positive Negative
Triplet: (Anchor, Positive, Negative)
Getting Started
A TripletSampler needs a SplitStore for record-to-split assignments and a SamplerConfig for runtime behavior.
use Arc;
use ;
Features
| Feature | What it enables | Default |
|---|---|---|
huggingface |
Streaming from Hugging Face dataset repositories. | No |
bm25-mining |
BM25 hard-negative ranking within strategy-defined pools. | No |
extended-metrics |
Additional per-triplet diagnostics for debugging. | No |
Configuring Sources
Local File Source
Recursively indexes text files from a directory. Ideal for local datasets and exported corpora.
use Arc;
use ;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
// Create a source named "docs" targeting a local directory.
let config = new
.with_text_files_only
.with_trust; // Assign a quality score to this source
let source = new;
sampler.register_source;
Hugging Face Source
Streams rows directly from the Hugging Face Hub without requiring a full dataset download.
Custom Data Source
Implement the IndexableSource trait to integrate any backend that can fetch records by a stable integer index.
use Arc;
use ;
use Utc;
use ;
use ;
;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
let adapter = new;
sampler.register_source;
Sampling and Mixing
Weighted Sampling
Adjust per-source sampling frequency to handle class imbalance or dataset quality differences.
use Arc;
use ;
use HashMap;
Output Format
The sampler produces SampleTriplet values containing sampled text and associated metadata.
use Arc;
use ;
let ratios = SplitRatios ;
let store = new;
let mut sampler = new;
let batch = sampler.next_triplet_batch.unwrap;
for triplet in batch.triplets
Epochs and Determinism
Iterating Epochs
In a typical training loop, signal a new epoch so the sampler can reset cursors and reshuffle sources.
use Arc;
use ;
Deterministic Resuming
To resume training, initialize a FileSplitStore at the same path. The sampler automatically restores cursors, RNG state, and epoch progress from that store.
use Arc;
use ;
Note: Sampler state is intentionally lightweight. It persists source identifiers, integer record cursors, and compact RNG state vectors, not full data records. This keeps frequent checkpointing practical in long-running training jobs.
Technical Details
Threading Model
Concurrency is handled at multiple levels for high throughput:
- Prefetching:
BatchPrefetcherruns a dedicated background worker thread that fills a bounded queue. - Parallel Ingestion: Source refresh executes concurrently across registered sources during ingestion cycles.
- Synchronous API: Sampling calls are synchronous at the API boundary for straightforward training-loop integration.
- Thread-Safe Shared Use:
TripletSampleris safe to share across threads (for example viaArc); concurrent calls are internally synchronized with a mutex, so a single sampler instance is callable from multiple threads without data races.
Chunking and Windows
Long documents are handled through a pluggable ChunkingAlgorithm. The default SlidingWindowChunker splits sections into fixed-size token windows with configurable overlap, preserving full coverage of long text.
Negative Mining
Negative selection is delegated to a pluggable backend.
- DefaultBackend: Uniform random selection from the candidate pool.
- Bm25Backend: (Requires
bm25-mining) Ranks candidates by lexical overlap with the anchor to provide harder training examples.
Capabilities
| Capability | Description |
|---|---|
| Source Agnostic | Implement DataSource or IndexableSource for any DB or API. |
| Weighted Sampling | Tune source and recipe frequencies to handle class imbalance. |
| Epoch Shuffling | Deterministic pseudo-random shuffling that re-permutes per epoch. |
| Instruction Tuning | Attach task-specific prompts (e.g., "Summarize this...") to specific recipes. |
| Metadata Decorators | Inject structured prefixes into sampled text via KvpPrefixSampler. |
| Anti-Shortcut | Includes anchor/positive swapping to avoid asymmetric slot bias. |
License
triplets is distributed under both the MIT license and the Apache License (Version 2.0).
See LICENSE-APACHE and LICENSE-MIT for details.