Skip to main content

Module tokenizer

lean_ctx::core::embeddings

Module tokenizer

Expand description

Minimal WordPiece tokenizer for BERT-style embedding models.

Implements the standard BERT tokenization pipeline:

Lowercase + accent stripping
Whitespace + punctuation splitting
WordPiece subword tokenization
Special token insertion ([CLS], [SEP])

Optimized for code search: handles camelCase, snake_case, and common programming punctuation correctly.

Structs§

TokenizedInput
WordPieceTokenizer