Expand description
Minimal tokenized-shard reader for MODEL-2 pretrain MVP (task #111).
Reads a directory of .bin files containing little-endian u32 tokens,
chunks them into seq_length + 1 sequences, and yields LMBatches of
batch_size sequences. No licensing filter, no MinHash dedup, no PII
scrub — those belong to apr-corpus-ingest run.
Contract: contracts/dataset-thestack-python-v1.yaml (shard format).
Structs§
- Shard
Batch Iter - Streaming iterator over
LMBatches produced from a directory of.bintoken shards (little-endian u32).