reshard-tokenized 0.1.0

CLI for merging tokenized shard .npy files and remapping .csv.gz metadata offsets.
Documentation

reshard-tokenized

A Rust CLI that merges tokenized shard files from a directory tree into one or more output shards:

  • .npy files are concatenated per shard
  • .csv.gz metadata rows are merged with remapped offsets

Prerequisites

  • Rust toolchain (stable)
  • Cargo (included with Rust)

Build

cargo build

Run

cargo run -- \
  --input-path tests/data/tokenized_input \
  --num-files 2 \
  --output-path /tmp/merged

CLI help:

cargo run -- --help

Test

cargo test

Integration tests use fixtures in tests/data/tokenized_input.

Output

  • If --num-files 1, output is written as <output-path>.npy and <output-path>.csv.gz
  • If --num-files > 1, output is written as:
    • <output-path>/00000000.npy, <output-path>/00000000.csv.gz
    • <output-path>/00000001.npy, <output-path>/00000001.csv.gz
    • ...