Expand description
Minimal WordPiece tokenizer for BERT-style embedding models.
Implements the standard BERT tokenization pipeline:
- Lowercase + accent stripping
- Whitespace + punctuation splitting
- WordPiece subword tokenization
- Special token insertion ([CLS], [SEP])
Optimized for code search: handles camelCase, snake_case, and common programming punctuation correctly.