Skip to main content

Module shard_reader

Module shard_reader 

Source
Expand description

Minimal tokenized-shard reader for MODEL-2 pretrain MVP (task #111).

Reads a directory of .bin files containing little-endian u32 tokens, chunks them into seq_length + 1 sequences, and yields LMBatches of batch_size sequences. No licensing filter, no MinHash dedup, no PII scrub — those belong to apr-corpus-ingest run.

Contract: contracts/dataset-thestack-python-v1.yaml (shard format).

Structs§

ShardBatchIter
Streaming iterator over LMBatches produced from a directory of .bin token shards (little-endian u32).