pub enum TokenizeCommands {
Plan {
data: PathBuf,
vocab_size: usize,
algorithm: String,
output: PathBuf,
format: String,
},
Apply {
data: PathBuf,
vocab_size: usize,
algorithm: String,
output: PathBuf,
max_lines: usize,
},
Train {
corpus: PathBuf,
vocab_size: usize,
min_frequency: usize,
output: PathBuf,
normalization: String,
},
ImportHf {
input: PathBuf,
output: PathBuf,
include_added_tokens: bool,
},
EncodeCorpus {
corpus: PathBuf,
tokenizer: PathBuf,
output: PathBuf,
shard_tokens: usize,
content_field: String,
normalization: String,
eos_policy: String,
},
}Expand description
Tokenizer training pipeline subcommands (forjar-style plan/apply).
Thin CLI wrappers around aprender’s BPE training infrastructure. Trains a BPE vocabulary from a text corpus for use in model training.
Variants§
Plan
Validate inputs and estimate tokenizer training time/resources.
Checks that the input corpus exists, counts lines/bytes, estimates vocabulary coverage, and reports expected training time. Outputs a serializable plan manifest (text, JSON, or YAML).
Analogous to forjar plan — shows what will happen before committing.
Fields
Apply
Train a tokenizer on the corpus.
Reads the input corpus, trains a BPE/WordPiece/Unigram tokenizer, and writes vocab.json + merges.txt to the output directory.
Analogous to forjar apply — commits resources and executes the plan.
Fields
Train
Train BPE on a JSONL corpus per contracts/tokenizer-bpe-v1.yaml (MODEL-2).
Walks --corpus (file or directory of .jsonl files), extracting the
content field from each line, applies --normalization (NFC default),
and trains a BPE tokenizer with the target vocab size. Writes
vocab.json (token→id) and merges.txt (one a b pair per line, in
merge order) to --output.
Fields
corpus: PathBufPath to corpus: a .jsonl file or a directory containing .jsonl files.
Each line must be a JSON object with a content field.
vocab_size: usizeTarget vocabulary size. Default 50_257 matches GPT-2 convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel) and the MODEL-2 albor tokenizer contract (tokenizer-bpe-v1 v1.2.0).
ImportHf
Import a HuggingFace tokenizer.json into aprender’s two-file
vocab.json + merges.txt layout per
contracts/apr-cli-tokenize-import-hf-v1.yaml (§50.4 step 5g.0).
Reads <INPUT> (a HF tokenizer.json with model.type == "BPE"),
extracts model.vocab → <OUTPUT>/vocab.json, model.merges →
<OUTPUT>/merges.txt (one space-separated merge per line), and
writes <OUTPUT>/manifest.json with extraction provenance
(source path, sha256, vocab_size, merges_count, timestamp).
Non-BPE inputs (Unigram, WordPiece) are rejected fail-fast with a clear error citing the contract id.
Unblocks fine-tuning from public HF checkpoints (Qwen2.5/Llama2/
Mistral) which distribute as a single tokenizer.json. The output
dir is consumable by apr tokenize encode-corpus --tokenizer <DIR>
and apr pretrain --tokenizer <DIR> without modification.
Fields
EncodeCorpus
Encode a JSONL corpus into .bin shards per contracts/pretokenize-bin-v1.yaml.
Loads a trained BPE tokenizer (vocab.json + merges.txt) from --tokenizer,
reads --corpus (file or directory of .jsonl files), encodes the
--content-field of each line to u32 tokens, and writes
shard-NNNN.bin files (flat little-endian u32 streams) into --output.
The output format is precisely what ShardBatchIter (aprender-train)
expects at MODEL-2 pretrain read time.
Root-cause fix for the pretokenize-to-bin gap documented in memory/project_shard_reader_bin_format.md — replaces a Python shim that was flagged as MUDA on 2026-04-19.
Fields
Trait Implementations§
Source§impl Debug for TokenizeCommands
impl Debug for TokenizeCommands
Source§impl FromArgMatches for TokenizeCommands
impl FromArgMatches for TokenizeCommands
Source§fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>
fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>
Source§fn from_arg_matches_mut(
__clap_arg_matches: &mut ArgMatches,
) -> Result<Self, Error>
fn from_arg_matches_mut( __clap_arg_matches: &mut ArgMatches, ) -> Result<Self, Error>
Source§fn update_from_arg_matches(
&mut self,
__clap_arg_matches: &ArgMatches,
) -> Result<(), Error>
fn update_from_arg_matches( &mut self, __clap_arg_matches: &ArgMatches, ) -> Result<(), Error>
ArgMatches to self.Source§fn update_from_arg_matches_mut<'b>(
&mut self,
__clap_arg_matches: &mut ArgMatches,
) -> Result<(), Error>
fn update_from_arg_matches_mut<'b>( &mut self, __clap_arg_matches: &mut ArgMatches, ) -> Result<(), Error>
ArgMatches to self.Source§impl Subcommand for TokenizeCommands
impl Subcommand for TokenizeCommands
Source§fn augment_subcommands<'b>(__clap_app: Command) -> Command
fn augment_subcommands<'b>(__clap_app: Command) -> Command
Source§fn augment_subcommands_for_update<'b>(__clap_app: Command) -> Command
fn augment_subcommands_for_update<'b>(__clap_app: Command) -> Command
Command so it can instantiate self via
FromArgMatches::update_from_arg_matches_mut Read moreSource§fn has_subcommand(__clap_name: &str) -> bool
fn has_subcommand(__clap_name: &str) -> bool
Self can parse a specific subcommandAuto Trait Implementations§
impl Freeze for TokenizeCommands
impl RefUnwindSafe for TokenizeCommands
impl Send for TokenizeCommands
impl Sync for TokenizeCommands
impl Unpin for TokenizeCommands
impl UnsafeUnpin for TokenizeCommands
impl UnwindSafe for TokenizeCommands
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> FmtForward for T
impl<T> FmtForward for T
Source§fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
fn fmt_binary(self) -> FmtBinary<Self>where
Self: Binary,
self to use its Binary implementation when Debug-formatted.Source§fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
fn fmt_display(self) -> FmtDisplay<Self>where
Self: Display,
self to use its Display implementation when
Debug-formatted.Source§fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
fn fmt_lower_exp(self) -> FmtLowerExp<Self>where
Self: LowerExp,
self to use its LowerExp implementation when
Debug-formatted.Source§fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
fn fmt_lower_hex(self) -> FmtLowerHex<Self>where
Self: LowerHex,
self to use its LowerHex implementation when
Debug-formatted.Source§fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
fn fmt_octal(self) -> FmtOctal<Self>where
Self: Octal,
self to use its Octal implementation when Debug-formatted.Source§fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
fn fmt_pointer(self) -> FmtPointer<Self>where
Self: Pointer,
self to use its Pointer implementation when
Debug-formatted.Source§fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
fn fmt_upper_exp(self) -> FmtUpperExp<Self>where
Self: UpperExp,
self to use its UpperExp implementation when
Debug-formatted.Source§fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
fn fmt_upper_hex(self) -> FmtUpperHex<Self>where
Self: UpperHex,
self to use its UpperHex implementation when
Debug-formatted.Source§impl<T> FutureExt for T
impl<T> FutureExt for T
Source§fn with_context(self, otel_cx: Context) -> WithContext<Self>
fn with_context(self, otel_cx: Context) -> WithContext<Self>
Source§fn with_current_context(self) -> WithContext<Self>
fn with_current_context(self) -> WithContext<Self>
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§impl<T> IntoRequest<T> for T
impl<T> IntoRequest<T> for T
Source§fn into_request(self) -> Request<T>
fn into_request(self) -> Request<T>
T in a tonic::RequestSource§impl<T> Pipe for Twhere
T: ?Sized,
impl<T> Pipe for Twhere
T: ?Sized,
Source§fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> Rwhere
Self: Sized,
Source§fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> Rwhere
R: 'a,
self and passes that borrow into the pipe function. Read moreSource§fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
Source§fn pipe_borrow_mut<'a, B, R>(
&'a mut self,
func: impl FnOnce(&'a mut B) -> R,
) -> R
fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
Source§fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
self, then passes self.as_ref() into the pipe function.Source§fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
self, then passes self.as_mut() into the pipe
function.Source§fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
self, then passes self.deref() into the pipe function.Source§impl<T> Pointable for T
impl<T> Pointable for T
Source§impl<T> PolicyExt for Twhere
T: ?Sized,
impl<T> PolicyExt for Twhere
T: ?Sized,
Source§impl<T> Tap for T
impl<T> Tap for T
Source§fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
Borrow<B> of a value. Read moreSource§fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
BorrowMut<B> of a value. Read moreSource§fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
AsRef<R> view of a value. Read moreSource§fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
AsMut<R> view of a value. Read moreSource§fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
Deref::Target of a value. Read moreSource§fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self
.tap() only in debug builds, and is erased in release builds.Source§fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self
.tap_mut() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
.tap_borrow() only in debug builds, and is erased in release
builds.Source§fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
.tap_borrow_mut() only in debug builds, and is erased in release
builds.Source§fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
.tap_ref() only in debug builds, and is erased in release
builds.Source§fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
.tap_ref_mut() only in debug builds, and is erased in release
builds.Source§fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
.tap_deref() only in debug builds, and is erased in release
builds.