Skip to main content

Module tokenizer

Module tokenizer 

Source
Expand description

Minimal WordPiece tokenizer for BERT-style embedding models.

Implements the standard BERT tokenization pipeline:

  1. Lowercase + accent stripping
  2. Whitespace + punctuation splitting
  3. WordPiece subword tokenization
  4. Special token insertion ([CLS], [SEP])

Optimized for code search: handles camelCase, snake_case, and common programming punctuation correctly.

Structsยง

TokenizedInput
WordPieceTokenizer