triplets
WORK IN PROGRESS.
Composable data sampling primitives for deterministic multi-source ML/AI training-data orchestration.
triplets is a reusable core for ML/AI training-data orchestration. It provides sampler primitives, split/state persistence, chunking and weighting mechanics, and source abstractions (DataSource, DataRecord) without tying behavior to proprietary corpora.
CI is configured to run tests/linting on macOS, Linux, and Windows.
High-level features
- Automatic deterministic splits (train/validation/test) from record IDs + seed.
- Runtime batch sampling via
next_triplet_batch,next_pair_batch, andnext_text_batch. - Recipe-driven sample construction for triplet/pair/text generation (anchor/positive/negative selectors).
- Weight-aware sampling controls across source weights, recipe weights, and chunk trust/quality weighting.
- Resume support via
persist_state()and split-store persistence. - Source-agnostic backends (
DataSourceorIndexableSource+IndexableAdapter). - Supply-chain style orchestration (core layer): multi-source intake (
refresh) with per-call parallel ingest, optional per-source weighting, staged buffering, deterministic split routing, and batch assembly into train-ready outputs. - Bounded ingestion windows instead of loading full corpora into memory.
- Per-call source threading: during refresh, each source is fetched on its own short-lived thread, then merged deterministically for batch assembly.
- Streaming-friendly: sources can be finite or unbounded.
This crate does not perform semantic mining/retrieval scoring by itself; instead, it gives you deterministic, metadata-driven sampling primitives you can feed into your downstream mining/retrieval stack.
Metadata-driven sampling flow
Use triplets to build deterministic training batches that carry metadata context:
- Put structural tags in
DataRecord.taxonomy(source/date/category/etc.) for filtering and analysis. - Use recipes/selectors to choose which sections become anchor/positive/negative text.
- Attach optional KVP metadata prefixes (below) so sampled text can include lightweight context headers.
- Keep split assignment deterministic while changing recipe or weighting behavior at runtime.
This gives you metadata-aware sampling orchestration, while semantic retrieval/mining logic stays in your downstream pipeline.
KVP data decorator
- Each
DataRecordcan carry an optionalmeta_prefixsampler (KvpPrefixSampler). - At sample time, the sampler can prepend a header line to chunk text, formatted like:
meta: key=value | key2=value2. KvpFieldsupports multiple value renderings per key and optional per-field presence probability.KvpPrefixSamplersupports variant selection and overall dropout (emit prefix sometimes, or always).- This is designed to give the model useful context signals (date/source/category/etc.) without making a single rigid header pattern easy to memorize.
- Multi-render values, per-field presence control, field-order variation, and prefix dropout reduce shortcut learning and encourage reliance on the underlying content.
- KVP prefixes decorate sampled text; they do not change deterministic split assignment.
Getting started
Add triplets to a downstream crate:
To run the included examples in this repository (for exploration/contributor workflow):
For contributors (development check):
Minimal shape:
- Implement one or more
DataSourcebackends. - Create
SamplerConfig(chunking, recipes, split policy). - Open a split store (
DeterministicSplitStoreorFileSplitStore). - Construct
PairSamplerand register sources. - Call one of the batch APIs:
next_triplet_batch(split),next_pair_batch(split), ornext_text_batch(split). - Call
persist_state()when you want restart-resume behavior.
Examples
From the triplets crate:
# sample triplet batches
# inspect CLI flags
# metadata-only capacity estimation
Source roots can be overridden with repeatable flags:
Split-store path configuration
The multi_source_demo example persists sampler/split state by default to:
.sampler_store/split_store.bin
You can override persistence location with either:
--split-store-path <FILE>for an explicit file path--split-store-dir <DIR>to keep filenamesplit_store.binin a custom directory
Usage flow
Short version:
- Call
sampler.next_triplet_batch(split),sampler.next_pair_batch(split), orsampler.next_text_batch(split)to sample batches (ingestion happens automatically). - Call
sampler.persist_state()when you want restart-resume behavior. - Optionally call
sampler.set_epoch(n)for explicit epoch control.
Step-by-step:
- Build config + open the split store.
- Register sources.
- Call one of
sampler.next_triplet_batch(split),sampler.next_pair_batch(split), orsampler.next_text_batch(split). - Call
sampler.persist_state()when you want to write persisted sampler/split state (typically at the end of an epoch or at explicit checkpoint boundaries). Do not call this every step. Very frequent writes can create high I/O overhead and, at very large write counts (for example, tens of millions), can also adversely affect split-store initialization time. - Optionally call
sampler.set_epoch(n)for explicit epoch replay/order.
Operational notes:
- File-backed indexing is rebuilt per process/run and stored in an OS temp-backed index store.
- Persisting sampler/split state is explicit and manual.
- One split-store file shares sampler/source cursor + RNG state unless you use separate store files.
- Batch calls are thread-safe but serialized; refresh work within a call can be parallelized per source.
- Source cursors advance independently per source, so one source can continue making progress even if another source is sparse or slower.
- Refresh concurrency is per call: source refreshes run in parallel for that call, then the sampler joins all refresh threads before merging buffers (not an always-on per-source background ingest loop).
- Prefetchers smooth latency by filling bounded queues from the existing batch APIs (
next_triplet_batch,next_pair_batch,next_text_batch). - New data from streaming sources is pulled in on the next batch call.
sampler.persist_state()is manual; skipping it means no resume state after restart.sampler.set_epoch(n)is an advanced override and is not required for normal resume behavior.IngestionManager::source_refresh_stats()exposes per-source refresh duration/records/throughput/errors.metrics::source_skewsummarizes per-source sample imbalance for a batch.
Example:
use Arc;
use ;
# let split = SplitRatios ;
# let store = new;
# let config = default;
let sampler = new;
// register sources...
let prefetcher = clone.prefetch_triplet_batches;
let batch = prefetcher.next.unwrap;
let _ = batch;
- For per-call source weighting, use
next_triplet_batch_with_weights(...),next_pair_batch_with_weights(...), ornext_text_batch_with_weights(...). - Missing source ids default to
1.0;0.0disables a source for that call.
Example (different source mix across consecutive batches):
use HashMap;
use Arc;
use ;
# let split = SplitRatios ;
# let store = new;
# let config = default;
# let sampler = new;
let mut weights_a = new;
weights_a.insert;
weights_a.insert;
let mut weights_b = new;
weights_b.insert;
weights_b.insert;
let batch_a = sampler
.next_triplet_batch_with_weights
.unwrap;
let batch_b = sampler
.next_triplet_batch_with_weights
.unwrap;
let _ = ;
- Production readiness note: if
len_hintdrifts in streaming/append-only sources, epoch order/coverage can repeat/skip records within an epoch, even though split assignment remains deterministic.
Sampling behavior (current)
This reflects the built-in file-corpus helpers (FileCorpusIndex) used by filesystem-backed sources.
- Ingestion:
next_triplet_batch(split),next_pair_batch(split), andnext_text_batch(split)trigger refresh; per-source buffers refill when empty (or on force refresh). - Memory bound: refresh/cache limits are bounded by
ingestion_max_recordswith a floor atbatch_size. - File indexing: deterministic path ordering + deterministic index permutation for paging.
- Source ordering: round-robin by source, deterministic within-source ordering by seed/epoch.
- Splits: labels are deterministic from
record_id + seed + ratios; split APIs enforceallowed_splits. - Coverage caveat: if
len_hintdrifts mid-epoch in streaming backends, strict single-pass coverage is not guaranteed. - Weights: recipe/source/chunk weights affect scaling, not deterministic ordering.
- Scale note: full scan/sort/index rebuild cost grows roughly linearly with file count and path bytes.
- Order note: index batching preserves permutation order; chunked index reads do not remove deterministic shuffling.
- Manual epoch control:
sampler.set_epoch(n)resets per-source cursors and reshuffles deterministically for that epoch. - Persisted state scope: epoch tracking is split-aware, but sampler/source cursors + RNG/round-robin state are persisted per store file.
- Triplet recipe behavior: per-source recipes are scanned from per-source round-robin hints until a match is found.
- Pair batches: derived from triplets and follow the same source/recipe selection behavior.
- Text recipes: follow per-source behavior when provided; otherwise config recipes are used.
- Oversampling: when sources run dry, cached records may be reused (no global no-repeat guarantee).
New-source implementation pattern
For any new backend (file/API/DB/stream), centralize backend configuration/state access in one helper reused by both refresh(...) and reported_record_count().
Why this matters: capacity estimates and runtime sampling stay aligned only when both methods represent the same logical corpus slice.
File-backed pattern:
If your records are time-ordered (oldest → newest), use these APIs:
IndexableSource(you providelen_hint()+record_at(idx)).IndexableAdapter(easiest: turns yourIndexableSourceinto aDataSource).IndexablePager(use directly only if you are writing a customrefresh(...)).
That is the built-in path for shuffled paging + cursor resume.
Helper-based path (uses the APIs above):
use ;
use ;
// register as a normal DataSource:
// sampler.register_source(Box::new(IndexableAdapter::new(MyIndexableSource { total_records })));
Manual path (does NOT use IndexableSource/IndexableAdapter directly):
use Utc;
use DefaultHasher;
use ;
use DataRecord;
use ;
use SamplerError;
Capacity estimates
The estimate helpers compute metadata-only approximations from source-reported counts and recipe structure.
- They do not call source refresh.
- They are floor-like approximations for real chunked training.
- Effective triplet estimates use bounded assumptions (positives/negatives per anchor).
Potential future directions (optional)
These are ideas, not commitments.
- Add more backend adapters in downstream crates (APIs, DBs, manifests, streams)
- Improve strict-coverage options for drifting/streaming corpora
- Add optional split-keyed sampler cursor state in a single store file
- Extend observability hooks for ingestion latency/skew/error diagnostics
License
triplets is primarily distributed under the terms of both the MIT license and the Apache License (Version 2.0).
See LICENSE-APACHE and LICENSE-MIT for details.