Skip to main content

TokenizeCommands

Enum TokenizeCommands 

Source
pub enum TokenizeCommands {
    Plan {
        data: PathBuf,
        vocab_size: usize,
        algorithm: String,
        output: PathBuf,
        format: String,
    },
    Apply {
        data: PathBuf,
        vocab_size: usize,
        algorithm: String,
        output: PathBuf,
        max_lines: usize,
    },
    Train {
        corpus: PathBuf,
        vocab_size: usize,
        min_frequency: usize,
        output: PathBuf,
        normalization: String,
    },
    ImportHf {
        input: PathBuf,
        output: PathBuf,
        include_added_tokens: bool,
    },
    EncodeCorpus {
        corpus: PathBuf,
        tokenizer: PathBuf,
        output: PathBuf,
        shard_tokens: usize,
        content_field: String,
        normalization: String,
        eos_policy: String,
    },
}
Expand description

Tokenizer training pipeline subcommands (forjar-style plan/apply).

Thin CLI wrappers around aprender’s BPE training infrastructure. Trains a BPE vocabulary from a text corpus for use in model training.

Variants§

§

Plan

Validate inputs and estimate tokenizer training time/resources.

Checks that the input corpus exists, counts lines/bytes, estimates vocabulary coverage, and reports expected training time. Outputs a serializable plan manifest (text, JSON, or YAML).

Analogous to forjar plan — shows what will happen before committing.

Fields

§data: PathBuf

Path to training corpus (text file, one document per line)

§vocab_size: usize

Target vocabulary size

§algorithm: String

Tokenizer algorithm: bpe, wordpiece, unigram

§output: PathBuf

Output directory for trained tokenizer

§format: String

Output format: text, json, yaml

§

Apply

Train a tokenizer on the corpus.

Reads the input corpus, trains a BPE/WordPiece/Unigram tokenizer, and writes vocab.json + merges.txt to the output directory.

Analogous to forjar apply — commits resources and executes the plan.

Fields

§data: PathBuf

Path to training corpus (text file, one document per line)

§vocab_size: usize

Target vocabulary size

§algorithm: String

Tokenizer algorithm: bpe, wordpiece, unigram

§output: PathBuf

Output directory for trained tokenizer

§max_lines: usize

Maximum number of lines to read from corpus (0 = all)

§

Train

Train BPE on a JSONL corpus per contracts/tokenizer-bpe-v1.yaml (MODEL-2).

Walks --corpus (file or directory of .jsonl files), extracting the content field from each line, applies --normalization (NFC default), and trains a BPE tokenizer with the target vocab size. Writes vocab.json (token→id) and merges.txt (one a b pair per line, in merge order) to --output.

Fields

§corpus: PathBuf

Path to corpus: a .jsonl file or a directory containing .jsonl files. Each line must be a JSON object with a content field.

§vocab_size: usize

Target vocabulary size. Default 50_257 matches GPT-2 convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel) and the MODEL-2 albor tokenizer contract (tokenizer-bpe-v1 v1.2.0).

§min_frequency: usize

Minimum frequency a byte-pair must reach before BPE merges it into a new vocabulary token (honored by entrenar::tokenizer::BPETokenizer per task #103). Pairs below this threshold are left unmerged — contract INV-TOK-002 of contracts/tokenizer-bpe-v1.yaml.

§output: PathBuf

Output directory; will contain vocab.json and merges.txt.

§normalization: String

Unicode normalization form applied to each document before training.

§

ImportHf

Import a HuggingFace tokenizer.json into aprender’s two-file vocab.json + merges.txt layout per contracts/apr-cli-tokenize-import-hf-v1.yaml (§50.4 step 5g.0).

Reads <INPUT> (a HF tokenizer.json with model.type == "BPE"), extracts model.vocab<OUTPUT>/vocab.json, model.merges<OUTPUT>/merges.txt (one space-separated merge per line), and writes <OUTPUT>/manifest.json with extraction provenance (source path, sha256, vocab_size, merges_count, timestamp).

Non-BPE inputs (Unigram, WordPiece) are rejected fail-fast with a clear error citing the contract id.

Unblocks fine-tuning from public HF checkpoints (Qwen2.5/Llama2/ Mistral) which distribute as a single tokenizer.json. The output dir is consumable by apr tokenize encode-corpus --tokenizer <DIR> and apr pretrain --tokenizer <DIR> without modification.

Fields

§input: PathBuf

Path to input HuggingFace tokenizer.json (BPE model required).

§output: PathBuf

Output directory; will contain vocab.json + merges.txt + manifest.json.

§include_added_tokens: bool

Include added_tokens in vocab.json (default: BPE state machine only). Use this when the downstream consumer needs special tokens (e.g., <|im_start|>, <|endoftext|>) materialized in vocab.json itself.

§

EncodeCorpus

Encode a JSONL corpus into .bin shards per contracts/pretokenize-bin-v1.yaml.

Loads a trained BPE tokenizer (vocab.json + merges.txt) from --tokenizer, reads --corpus (file or directory of .jsonl files), encodes the --content-field of each line to u32 tokens, and writes shard-NNNN.bin files (flat little-endian u32 streams) into --output. The output format is precisely what ShardBatchIter (aprender-train) expects at MODEL-2 pretrain read time.

Root-cause fix for the pretokenize-to-bin gap documented in memory/project_shard_reader_bin_format.md — replaces a Python shim that was flagged as MUDA on 2026-04-19.

Fields

§corpus: PathBuf

Path to JSONL corpus file or directory of .jsonl files.

§tokenizer: PathBuf

Directory containing vocab.json + merges.txt from apr tokenize train.

§output: PathBuf

Output directory for shard-NNNN.bin + manifest.json.

§shard_tokens: usize

Target tokens per shard (shard closes once this limit is reached).

§content_field: String

JSONL field to encode (default: content).

§normalization: String

Unicode normalization (must match tokenizer training).

§eos_policy: String

EOS insertion policy: none|between|after.

Trait Implementations§

Source§

impl Debug for TokenizeCommands

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl FromArgMatches for TokenizeCommands

Source§

fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn from_arg_matches_mut( __clap_arg_matches: &mut ArgMatches, ) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn update_from_arg_matches( &mut self, __clap_arg_matches: &ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

fn update_from_arg_matches_mut<'b>( &mut self, __clap_arg_matches: &mut ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

impl Subcommand for TokenizeCommands

Source§

fn augment_subcommands<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate Self via FromArgMatches::from_arg_matches_mut Read more
Source§

fn augment_subcommands_for_update<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate self via FromArgMatches::update_from_arg_matches_mut Read more
Source§

fn has_subcommand(__clap_name: &str) -> bool

Test whether Self can parse a specific subcommand

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> Conv for T

Source§

fn conv<T>(self) -> T
where Self: Into<T>,

Converts self into T using Into<T>. Read more
Source§

impl<T> Downcast<T> for T

Source§

fn downcast(&self) -> &T

Source§

impl<T> FmtForward for T

Source§

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

Causes self to use its Binary implementation when Debug-formatted.
Source§

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

Causes self to use its Display implementation when Debug-formatted.
Source§

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

Causes self to use its LowerExp implementation when Debug-formatted.
Source§

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

Causes self to use its LowerHex implementation when Debug-formatted.
Source§

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

Causes self to use its Octal implementation when Debug-formatted.
Source§

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

Causes self to use its Pointer implementation when Debug-formatted.
Source§

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

Causes self to use its UpperExp implementation when Debug-formatted.
Source§

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

Causes self to use its UpperHex implementation when Debug-formatted.
Source§

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

Formats each item in a sequence. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> FutureExt for T

Source§

fn with_context(self, otel_cx: Context) -> WithContext<Self>

Attaches the provided Context to this type, returning a WithContext wrapper. Read more
Source§

fn with_current_context(self) -> WithContext<Self>

Attaches the current Context to this type, returning a WithContext wrapper. Read more
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> IntoRequest<T> for T

Source§

fn into_request(self) -> Request<T>

Wrap the input message T in a tonic::Request
Source§

impl<L> LayerExt<L> for L

Source§

fn named_layer<S>(&self, service: S) -> Layered<<L as Layer<S>>::Service, S>
where L: Layer<S>,

Applies the layer to a service and wraps it in Layered.
Source§

impl<T> Pipe for T
where T: ?Sized,

Source§

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

Pipes by value. This is generally the method you want to use. Read more
Source§

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

Borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

Mutably borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

Borrows self, then passes self.borrow() into the pipe function. Read more
Source§

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.borrow_mut() into the pipe function. Read more
Source§

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

Borrows self, then passes self.as_ref() into the pipe function.
Source§

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.as_mut() into the pipe function.
Source§

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

Borrows self, then passes self.deref() into the pipe function.
Source§

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.deref_mut() into the pipe function.
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> Tap for T

Source§

fn tap(self, func: impl FnOnce(&Self)) -> Self

Immutable access to a value. Read more
Source§

fn tap_mut(self, func: impl FnOnce(&mut Self)) -> Self

Mutable access to a value. Read more
Source§

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Immutable access to the Borrow<B> of a value. Read more
Source§

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Mutable access to the BorrowMut<B> of a value. Read more
Source§

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Immutable access to the AsRef<R> view of a value. Read more
Source§

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Mutable access to the AsMut<R> view of a value. Read more
Source§

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Immutable access to the Deref::Target of a value. Read more
Source§

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Mutable access to the Deref::Target of a value. Read more
Source§

fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self

Calls .tap() only in debug builds, and is erased in release builds.
Source§

fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self

Calls .tap_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Calls .tap_borrow() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Calls .tap_borrow_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Calls .tap_ref() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Calls .tap_ref_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Calls .tap_deref() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Calls .tap_deref_mut() only in debug builds, and is erased in release builds.
Source§

impl<T> TryConv for T

Source§

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

Attempts to convert self into T using TryInto<T>. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> Upcast<T> for T

Source§

fn upcast(&self) -> Option<&T>

Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

Source§

impl<A, B, T> HttpServerConnExec<A, B> for T
where B: Body,

Source§

impl<T> WasmNotSend for T
where T: Send,

Source§

impl<T> WasmNotSendSync for T

Source§

impl<T> WasmNotSync for T
where T: Sync,