triplets
WORK IN PROGRESS
Composable Rust crate for deterministic multi-source sampling and split persistence for ML/AI training data.
triplets is a reusable core for ML/AI training-data orchestration. It provides sampler primitives, split/state persistence, chunking and weighting mechanics, and source abstractions (DataSource, DataRecord) without tying behavior to proprietary corpora.
At a glance
triplets is for building reproducible ML/AI training batches from multiple data sources.
Compared with a static prebuilt dataset, it lets you sample at runtime while preserving deterministic behavior.
Threading model: source refresh work is parallelized per sampling call, while batch assembly remains serialized and deterministic.
Core capabilities
- Source-agnostic sampling: implement
DataSourcefor filesystem, APIs, DBs, streams, etc. - Runtime example generation: produce triplet/pair/text batches from recipe selectors.
- Deterministic split assignment: stable train/validation/test assignment from record IDs + seed.
- Resume support: persist sampler/split state and continue after restart.
- Bounded ingestion: refresh in controlled windows instead of loading full corpora into memory.
- Per-source progression: each source has its own cursor; sources do not need to advance in lockstep.
- Per-call concurrency: source refreshes run in parallel within a sampling call, then merge before batch assembly.
Not included
- This crate does not do semantic mining/retrieval scoring by itself.
- This crate does not guarantee semantic hardness beyond your recipes and source metadata design.
- Sources can be finite or unbounded; infinite streaming is supported but not required.
Getting started
Add triplets to a downstream crate:
To run the included examples in this repository (for exploration/contributor workflow):
For contributors (development check):
Minimal shape:
- Implement one or more
DataSourcebackends. - Create
SamplerConfig(chunking, recipes, split policy). - Open a split store (
DeterministicSplitStoreorFileSplitStore). - Construct
PairSamplerand register sources. - Call one of the batch APIs:
next_triplet_batch(split),next_pair_batch(split), ornext_text_batch(split). - Call
persist_state()when you want restart-resume behavior.
Examples
From the triplets crate:
# sample triplet batches
# inspect CLI flags
# metadata-only capacity estimation
Source roots can be overridden with repeatable flags:
Split-store path configuration
The multi_source_demo example persists sampler/split state by default to:
.sampler_store/split_store.bin
You can override persistence location with either:
--split-store-path <FILE>for an explicit file path--split-store-dir <DIR>to keep filenamesplit_store.binin a custom directory
Usage flow
Short version:
- Call
sampler.next_*_batch(split)to sample batches (ingestion happens automatically). - Call
sampler.persist_state()when you want restart-resume behavior. - Optionally call
sampler.set_epoch(n)for explicit epoch control.
Step-by-step:
- Build config + open the split store.
- Register sources.
- Call one of
sampler.next_triplet_batch(split),sampler.next_pair_batch(split), orsampler.next_text_batch(split). - Call
sampler.persist_state()when you want to save progress (typically at the end of an epoch, or at explicit checkpoint boundaries). - Optionally call
sampler.set_epoch(n)for explicit epoch replay/order.
Operational notes:
- File-backed indexing is rebuilt per process/run and stored in an OS temp-backed index store.
- Persisting sampler/split state is explicit and manual.
- One split-store file shares sampler/source cursor + RNG state unless you use separate store files.
- Batch calls are thread-safe but serialized; refresh work within a call can be parallelized per source.
- Source cursors advance independently per source, so one source can continue making progress even if another source is sparse or slower.
- Refresh concurrency is per call: source refreshes run in parallel for that call, then the sampler joins all refresh threads before merging buffers (not an always-on per-source background ingest loop).
- Prefetchers smooth latency by filling bounded queues from the existing batch APIs (
next_triplet_batch,next_pair_batch,next_text_batch). - New data from streaming sources is pulled in on the next batch call.
sampler.persist_state()is manual; skipping it means no resume state after restart.sampler.set_epoch(n)is an advanced override and is not required for normal resume behavior.IngestionManager::source_refresh_stats()exposes per-source refresh duration/records/throughput/errors.metrics::source_skewsummarizes per-source sample imbalance for a batch.
Example:
use Arc;
use ;
# let split = SplitRatios ;
# let store = new;
# let config = default;
let sampler = new;
// register sources...
let prefetcher = clone.prefetch_triplet_batches;
let batch = prefetcher.next.unwrap;
let _ = batch;
- For per-call source weighting, use
next_triplet_batch_with_weights(...),next_pair_batch_with_weights(...), ornext_text_batch_with_weights(...). - Missing source ids default to
1.0;0.0disables a source for that call. - Production readiness note: if
len_hintdrifts in streaming/append-only sources, epoch order/coverage can repeat/skip records within an epoch, even though split assignment remains deterministic.
Sampling behavior (current)
This reflects the built-in file-corpus helpers (FileCorpusIndex) used by filesystem-backed sources.
- Ingestion:
next_triplet_batch(split),next_pair_batch(split), andnext_text_batch(split)trigger refresh; per-source buffers refill when empty (or on force refresh). - Memory bound: refresh/cache limits are bounded by
ingestion_max_recordswith a floor atbatch_size. - File indexing: deterministic path ordering + deterministic index permutation for paging.
- Source ordering: round-robin by source, deterministic within-source ordering by seed/epoch.
- Splits: labels are deterministic from
record_id + seed + ratios; split APIs enforceallowed_splits. - Coverage caveat: if
len_hintdrifts mid-epoch in streaming backends, strict single-pass coverage is not guaranteed. - Weights: recipe/source/chunk weights affect scaling, not deterministic ordering.
- Scale note: full scan/sort/index rebuild cost grows roughly linearly with file count and path bytes.
- Order note: index batching preserves permutation order; chunked index reads do not remove deterministic shuffling.
- Manual epoch control:
sampler.set_epoch(n)resets per-source cursors and reshuffles deterministically for that epoch. - Persisted state scope: epoch tracking is split-aware, but sampler/source cursors + RNG/round-robin state are persisted per store file.
- Triplet recipe behavior: per-source recipes are scanned from per-source round-robin hints until a match is found.
- Pair batches: derived from triplets and follow the same source/recipe selection behavior.
- Text recipes: follow per-source behavior when provided; otherwise config recipes are used.
- Oversampling: when sources run dry, cached records may be reused (no global no-repeat guarantee).
New-source implementation pattern
For any new backend (file/API/DB/stream), centralize backend configuration/state access in one helper reused by both refresh(...) and reported_record_count().
Why this matters: capacity estimates and runtime sampling stay aligned only when both methods represent the same logical corpus slice.
File-backed pattern:
If a source emits sequential IDs, implement indexable paging (IndexableSource + IndexablePager or IndexableAdapter) to avoid time-ordered ingestion bias.
Example hash-sorted refresh skeleton:
use Utc;
use DefaultHasher;
use ;
use DataRecord;
use ;
use SamplerError;
Capacity estimates
The estimate helpers compute metadata-only approximations from source-reported counts and recipe structure.
- They do not call source refresh.
- They are floor-like approximations for real chunked training.
- Effective triplet estimates use bounded assumptions (positives/negatives per anchor).
Potential future directions (optional)
These are ideas, not commitments.
- Add more backend adapters in downstream crates (APIs, DBs, manifests, streams)
- Improve strict-coverage options for drifting/streaming corpora
- Add optional split-keyed sampler cursor state in a single store file
- Extend observability hooks for ingestion latency/skew/error diagnostics
License
triplets is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0).
See LICENSE-APACHE and LICENSE-MIT for details.