Skip to main content

TokenizeCommands

Enum TokenizeCommands 

Source
pub enum TokenizeCommands {
    Plan {
        data: PathBuf,
        vocab_size: usize,
        algorithm: String,
        output: PathBuf,
        format: String,
    },
    Apply {
        data: PathBuf,
        vocab_size: usize,
        algorithm: String,
        output: PathBuf,
        max_lines: usize,
    },
    Train {
        corpus: PathBuf,
        vocab_size: usize,
        min_frequency: usize,
        output: PathBuf,
        normalization: String,
    },
    ImportHf {
        input: PathBuf,
        output: PathBuf,
        include_added_tokens: bool,
    },
    EncodeCorpus {
Show 13 fields corpus: Vec<PathBuf>, tokenizer: PathBuf, output: PathBuf, shard_tokens: usize, content_field: String, normalization: String, eos_policy: String, num_workers: Option<usize>, quiet: bool, progress_interval_docs: u64, progress_interval_seconds: u64, estimate_only: bool, estimate_sample_docs: u64,
}, RepairManifest { output: PathBuf, tokenizer: Option<PathBuf>, json: bool, }, }
Expand description

Tokenizer training pipeline subcommands (forjar-style plan/apply).

Thin CLI wrappers around aprender’s BPE training infrastructure. Trains a BPE vocabulary from a text corpus for use in model training.

Variants§

§

Plan

Validate inputs and estimate tokenizer training time/resources.

Checks that the input corpus exists, counts lines/bytes, estimates vocabulary coverage, and reports expected training time. Outputs a serializable plan manifest (text, JSON, or YAML).

Analogous to forjar plan — shows what will happen before committing.

Fields

§data: PathBuf

Path to training corpus (text file, one document per line)

§vocab_size: usize

Target vocabulary size

§algorithm: String

Tokenizer algorithm: bpe, wordpiece, unigram

§output: PathBuf

Output directory for trained tokenizer

§format: String

Output format: text, json, yaml

§

Apply

Train a tokenizer on the corpus.

Reads the input corpus, trains a BPE/WordPiece/Unigram tokenizer, and writes vocab.json + merges.txt to the output directory.

Analogous to forjar apply — commits resources and executes the plan.

Fields

§data: PathBuf

Path to training corpus (text file, one document per line)

§vocab_size: usize

Target vocabulary size

§algorithm: String

Tokenizer algorithm: bpe, wordpiece, unigram

§output: PathBuf

Output directory for trained tokenizer

§max_lines: usize

Maximum number of lines to read from corpus (0 = all)

§

Train

Train BPE on a JSONL corpus per contracts/tokenizer-bpe-v1.yaml (MODEL-2).

Walks --corpus (file or directory of .jsonl files), extracting the content field from each line, applies --normalization (NFC default), and trains a BPE tokenizer with the target vocab size. Writes vocab.json (token→id) and merges.txt (one a b pair per line, in merge order) to --output.

Fields

§corpus: PathBuf

Path to corpus: a .jsonl file or a directory containing .jsonl files. Each line must be a JSON object with a content field.

§vocab_size: usize

Target vocabulary size. Default 50_257 matches GPT-2 convention (50_000 BPE merges + 256 byte-level fallback tokens + 1 sentinel) and the MODEL-2 albor tokenizer contract (tokenizer-bpe-v1 v1.2.0).

§min_frequency: usize

Minimum frequency a byte-pair must reach before BPE merges it into a new vocabulary token (honored by entrenar::tokenizer::BPETokenizer per task #103). Pairs below this threshold are left unmerged — contract INV-TOK-002 of contracts/tokenizer-bpe-v1.yaml.

§output: PathBuf

Output directory; will contain vocab.json and merges.txt.

§normalization: String

Unicode normalization form applied to each document before training.

§

ImportHf

Import a HuggingFace tokenizer.json into aprender’s two-file vocab.json + merges.txt layout per contracts/apr-cli-tokenize-import-hf-v1.yaml (§50.4 step 5g.0).

Reads <INPUT> (a HF tokenizer.json with model.type == "BPE"), extracts model.vocab<OUTPUT>/vocab.json, model.merges<OUTPUT>/merges.txt (one space-separated merge per line), and writes <OUTPUT>/manifest.json with extraction provenance (source path, sha256, vocab_size, merges_count, timestamp).

Non-BPE inputs (Unigram, WordPiece) are rejected fail-fast with a clear error citing the contract id.

Unblocks fine-tuning from public HF checkpoints (Qwen2.5/Llama2/ Mistral) which distribute as a single tokenizer.json. The output dir is consumable by apr tokenize encode-corpus --tokenizer <DIR> and apr pretrain --tokenizer <DIR> without modification.

Fields

§input: PathBuf

Path to input HuggingFace tokenizer.json (BPE model required).

§output: PathBuf

Output directory; will contain vocab.json + merges.txt + manifest.json.

§include_added_tokens: bool

Include added_tokens in vocab.json (default: BPE state machine only). Use this when the downstream consumer needs special tokens (e.g., <|im_start|>, <|endoftext|>) materialized in vocab.json itself.

§

EncodeCorpus

Encode a JSONL corpus into .bin shards per contracts/pretokenize-bin-v1.yaml.

Loads a trained BPE tokenizer (vocab.json + merges.txt) from --tokenizer, reads --corpus (file or directory of .jsonl files), encodes the --content-field of each line to u32 tokens, and writes shard-NNNN.bin files (flat little-endian u32 streams) into --output. The output format is precisely what ShardBatchIter (aprender-train) expects at MODEL-2 pretrain read time.

Root-cause fix for the pretokenize-to-bin gap documented in memory/project_shard_reader_bin_format.md — replaces a Python shim that was flagged as MUDA on 2026-04-19.

Fields

§corpus: Vec<PathBuf>

Path to JSONL corpus file, parquet shard, or directory of .jsonl or .parquet files. Pass --corpus multiple times to merge multiple sources into a single output corpus (SPEC §83 P2-C — see contracts/corpus-merge-v3-v1.yaml). When repeated, sources are encoded in command-line order and shard numbering is continuous across sources.

§tokenizer: PathBuf

Directory containing vocab.json + merges.txt from apr tokenize train.

§output: PathBuf

Output directory for shard-NNNN.bin + manifest.json.

§shard_tokens: usize

Target tokens per shard (shard closes once this limit is reached).

§content_field: String

JSONL field to encode (default: content).

§normalization: String

Unicode normalization (must match tokenizer training).

§eos_policy: String

EOS insertion policy: none|between|after.

§num_workers: Option<usize>

Number of rayon workers for per-document BPE encoding.

Defaults to std::thread::available_parallelism() (logical CPU count). Set to 1 to force the single-threaded byte-identical legacy path. Set to a fixed N to bound memory or share the host with other jobs.

Output shard order is preserved: chunked encoding keeps original document order regardless of worker count (issue #1547, contracts/apr-tokenize-parallel-bpe-v1.yaml parallel_correctness).

§quiet: bool

Suppress per-document progress emission to stderr (issue #1547, contract v1.2.0). Default: emit a [progress] doc=N/T tokens=K rate=X.X docs/s eta=... line every --progress-interval-docs docs OR --progress-interval-seconds seconds (whichever fires first). Useful for CI / log-scraping callers that prefer silence.

§progress_interval_docs: u64

Emit a progress line at most every N docs (default 1000). Pair with --progress-interval-seconds — whichever bound is reached first triggers emission. Issue #1547 contract v1.2.0.

§progress_interval_seconds: u64

Emit a progress line at most every S seconds (default 60). Pair with --progress-interval-docs — whichever bound is reached first triggers emission. Issue #1547 contract v1.2.0.

§estimate_only: bool

Pre-flight only: estimate total tokens / shards / wall time without writing any output. Reads --estimate-sample-docs (default 1000), encodes them, observes (tokens, wall-time- per-doc), and extrapolates against the total document count. Emits [estimate] lines on stderr; no shards or manifest are written. Operator pre-flight gate before multi-day encode runs (issue #1547 contract v1.3.0).

§estimate_sample_docs: u64

Number of documents to sample for --estimate-only extrapolation (default: 1000). Larger samples → tighter per-doc rate estimate but longer pre-flight wall.

§

RepairManifest

Reconstruct manifest.json from existing shard-NNNN.bin files.

apr tokenize encode-corpus writes manifest.json only on clean process exit. If the encoder is killed (operator SIGINT, OOM, crash, power loss) AFTER all shards flush but BEFORE manifest write, the corpus on disk is consumable by ShardBatchIter but has no provenance file for ship audit / dashboards.

repair-manifest is the cheap recovery path: it scans <OUTPUT>/shard-*.bin, computes shard_count + total_tokens from file sizes (each shard is a flat little-endian u32 stream; tokens = file_size / 4), and writes a schema-conforming manifest.json. Idempotent: runs twice are byte-identical modulo repaired_at timestamp.

Contract: contracts/apr-tokenize-repair-manifest-v1.yaml. Motivating instance: SHIP-TWO §56 5g.1 corpus (228 shards flushed, manifest missing).

Fields

§output: PathBuf

Output directory containing shard-NNNN.bin files. manifest.json will be written into this directory.

§tokenizer: Option<PathBuf>

Optional tokenizer directory; when provided, vocab.json is read for the manifest’s vocab_size field. Without it, vocab_size is recorded as null (provenance-incomplete but otherwise valid).

§json: bool

Emit the manifest body as JSON to stdout (in addition to writing to disk).

Trait Implementations§

Source§

impl Debug for TokenizeCommands

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl FromArgMatches for TokenizeCommands

Source§

fn from_arg_matches(__clap_arg_matches: &ArgMatches) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn from_arg_matches_mut( __clap_arg_matches: &mut ArgMatches, ) -> Result<Self, Error>

Instantiate Self from ArgMatches, parsing the arguments as needed. Read more
Source§

fn update_from_arg_matches( &mut self, __clap_arg_matches: &ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

fn update_from_arg_matches_mut<'b>( &mut self, __clap_arg_matches: &mut ArgMatches, ) -> Result<(), Error>

Assign values from ArgMatches to self.
Source§

impl Subcommand for TokenizeCommands

Source§

fn augment_subcommands<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate Self via FromArgMatches::from_arg_matches_mut Read more
Source§

fn augment_subcommands_for_update<'b>(__clap_app: Command) -> Command

Append to Command so it can instantiate self via FromArgMatches::update_from_arg_matches_mut Read more
Source§

fn has_subcommand(__clap_name: &str) -> bool

Test whether Self can parse a specific subcommand

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> Conv for T

Source§

fn conv<T>(self) -> T
where Self: Into<T>,

Converts self into T using Into<T>. Read more
Source§

impl<T> Downcast<T> for T

Source§

fn downcast(&self) -> &T

Source§

impl<T> FmtForward for T

Source§

fn fmt_binary(self) -> FmtBinary<Self>
where Self: Binary,

Causes self to use its Binary implementation when Debug-formatted.
Source§

fn fmt_display(self) -> FmtDisplay<Self>
where Self: Display,

Causes self to use its Display implementation when Debug-formatted.
Source§

fn fmt_lower_exp(self) -> FmtLowerExp<Self>
where Self: LowerExp,

Causes self to use its LowerExp implementation when Debug-formatted.
Source§

fn fmt_lower_hex(self) -> FmtLowerHex<Self>
where Self: LowerHex,

Causes self to use its LowerHex implementation when Debug-formatted.
Source§

fn fmt_octal(self) -> FmtOctal<Self>
where Self: Octal,

Causes self to use its Octal implementation when Debug-formatted.
Source§

fn fmt_pointer(self) -> FmtPointer<Self>
where Self: Pointer,

Causes self to use its Pointer implementation when Debug-formatted.
Source§

fn fmt_upper_exp(self) -> FmtUpperExp<Self>
where Self: UpperExp,

Causes self to use its UpperExp implementation when Debug-formatted.
Source§

fn fmt_upper_hex(self) -> FmtUpperHex<Self>
where Self: UpperHex,

Causes self to use its UpperHex implementation when Debug-formatted.
Source§

fn fmt_list(self) -> FmtList<Self>
where &'a Self: for<'a> IntoIterator,

Formats each item in a sequence. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> FutureExt for T

Source§

fn with_context(self, otel_cx: Context) -> WithContext<Self>

Attaches the provided Context to this type, returning a WithContext wrapper. Read more
Source§

fn with_current_context(self) -> WithContext<Self>

Attaches the current Context to this type, returning a WithContext wrapper. Read more
Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> IntoRequest<T> for T

Source§

fn into_request(self) -> Request<T>

Wrap the input message T in a tonic::Request
Source§

impl<L> LayerExt<L> for L

Source§

fn named_layer<S>(&self, service: S) -> Layered<<L as Layer<S>>::Service, S>
where L: Layer<S>,

Applies the layer to a service and wraps it in Layered.
Source§

impl<T> Pipe for T
where T: ?Sized,

Source§

fn pipe<R>(self, func: impl FnOnce(Self) -> R) -> R
where Self: Sized,

Pipes by value. This is generally the method you want to use. Read more
Source§

fn pipe_ref<'a, R>(&'a self, func: impl FnOnce(&'a Self) -> R) -> R
where R: 'a,

Borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_ref_mut<'a, R>(&'a mut self, func: impl FnOnce(&'a mut Self) -> R) -> R
where R: 'a,

Mutably borrows self and passes that borrow into the pipe function. Read more
Source§

fn pipe_borrow<'a, B, R>(&'a self, func: impl FnOnce(&'a B) -> R) -> R
where Self: Borrow<B>, B: 'a + ?Sized, R: 'a,

Borrows self, then passes self.borrow() into the pipe function. Read more
Source§

fn pipe_borrow_mut<'a, B, R>( &'a mut self, func: impl FnOnce(&'a mut B) -> R, ) -> R
where Self: BorrowMut<B>, B: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.borrow_mut() into the pipe function. Read more
Source§

fn pipe_as_ref<'a, U, R>(&'a self, func: impl FnOnce(&'a U) -> R) -> R
where Self: AsRef<U>, U: 'a + ?Sized, R: 'a,

Borrows self, then passes self.as_ref() into the pipe function.
Source§

fn pipe_as_mut<'a, U, R>(&'a mut self, func: impl FnOnce(&'a mut U) -> R) -> R
where Self: AsMut<U>, U: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.as_mut() into the pipe function.
Source§

fn pipe_deref<'a, T, R>(&'a self, func: impl FnOnce(&'a T) -> R) -> R
where Self: Deref<Target = T>, T: 'a + ?Sized, R: 'a,

Borrows self, then passes self.deref() into the pipe function.
Source§

fn pipe_deref_mut<'a, T, R>( &'a mut self, func: impl FnOnce(&'a mut T) -> R, ) -> R
where Self: DerefMut<Target = T> + Deref, T: 'a + ?Sized, R: 'a,

Mutably borrows self, then passes self.deref_mut() into the pipe function.
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> Tap for T

Source§

fn tap(self, func: impl FnOnce(&Self)) -> Self

Immutable access to a value. Read more
Source§

fn tap_mut(self, func: impl FnOnce(&mut Self)) -> Self

Mutable access to a value. Read more
Source§

fn tap_borrow<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Immutable access to the Borrow<B> of a value. Read more
Source§

fn tap_borrow_mut<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Mutable access to the BorrowMut<B> of a value. Read more
Source§

fn tap_ref<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Immutable access to the AsRef<R> view of a value. Read more
Source§

fn tap_ref_mut<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Mutable access to the AsMut<R> view of a value. Read more
Source§

fn tap_deref<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Immutable access to the Deref::Target of a value. Read more
Source§

fn tap_deref_mut<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Mutable access to the Deref::Target of a value. Read more
Source§

fn tap_dbg(self, func: impl FnOnce(&Self)) -> Self

Calls .tap() only in debug builds, and is erased in release builds.
Source§

fn tap_mut_dbg(self, func: impl FnOnce(&mut Self)) -> Self

Calls .tap_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_dbg<B>(self, func: impl FnOnce(&B)) -> Self
where Self: Borrow<B>, B: ?Sized,

Calls .tap_borrow() only in debug builds, and is erased in release builds.
Source§

fn tap_borrow_mut_dbg<B>(self, func: impl FnOnce(&mut B)) -> Self
where Self: BorrowMut<B>, B: ?Sized,

Calls .tap_borrow_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_dbg<R>(self, func: impl FnOnce(&R)) -> Self
where Self: AsRef<R>, R: ?Sized,

Calls .tap_ref() only in debug builds, and is erased in release builds.
Source§

fn tap_ref_mut_dbg<R>(self, func: impl FnOnce(&mut R)) -> Self
where Self: AsMut<R>, R: ?Sized,

Calls .tap_ref_mut() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_dbg<T>(self, func: impl FnOnce(&T)) -> Self
where Self: Deref<Target = T>, T: ?Sized,

Calls .tap_deref() only in debug builds, and is erased in release builds.
Source§

fn tap_deref_mut_dbg<T>(self, func: impl FnOnce(&mut T)) -> Self
where Self: DerefMut<Target = T> + Deref, T: ?Sized,

Calls .tap_deref_mut() only in debug builds, and is erased in release builds.
Source§

impl<T> TryConv for T

Source§

fn try_conv<T>(self) -> Result<T, Self::Error>
where Self: TryInto<T>,

Attempts to convert self into T using TryInto<T>. Read more
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<T> Upcast<T> for T

Source§

fn upcast(&self) -> Option<&T>

Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

impl<T> Allocation for T
where T: RefUnwindSafe + Send + Sync,

Source§

impl<A, B, T> HttpServerConnExec<A, B> for T
where B: Body,

Source§

impl<T> WasmNotSend for T
where T: Send,

Source§

impl<T> WasmNotSendSync for T

Source§

impl<T> WasmNotSync for T
where T: Sync,