reshard-tokenized
A Rust CLI that merges tokenized shard files from a directory tree into one or more output shards:
.npyfiles are concatenated per shard.csv.gzmetadata rows are merged with remapped offsets
Prerequisites
- Rust toolchain (stable)
- Cargo (included with Rust)
Build
Run
CLI help:
Test
Integration tests use fixtures in tests/data/tokenized_input.
Output
- If
--num-files 1, output is written as<output-path>.npyand<output-path>.csv.gz - If
--num-files > 1, output is written as:<output-path>/00000000.npy,<output-path>/00000000.csv.gz<output-path>/00000001.npy,<output-path>/00000001.csv.gz- ...