Expand description
Tokenizer Surgery for Vocabulary Transplantation (GH-447)
When adapting a pre-trained model to a new tokenizer (e.g., domain-specific BPE vocabulary), embedding rows must be transplanted from the source model to the target. Tokens present in both vocabularies get direct copies; missing tokens are handled via nearest-neighbor lookup or average pooling.
§Key Design Decisions
- Overlap threshold: Surgery is rejected if the vocabularies share fewer tokens than the configured threshold (default 50%), preventing catastrophic representation loss
- Three methods:
DirectCopy(fastest, zero-fills missing),NearestNeighbor(finds closest match),AveragePool(mean of all source embeddings for missing tokens)
§References
- Hewitt et al. 2021: “Initializing New Word Embeddings for Pretrained Language Models”
- Minixhofer et al. 2022: “WECHSEL: Effective Initialization of Subword Embeddings for Cross-Lingual Transfer of Monolingual Models”
§Toyota Way Principles
- Poka-Yoke: Overlap threshold prevents silent vocabulary misalignment
- Jidoka: Validation stops surgery if quality gate fails
Structs§
- Surgery
Report - Summary report produced after an embedding transplant operation.
- Tokenizer
Surgery Config - Configuration for tokenizer surgery.
- Vocab
Mapping - Bidirectional mapping between source and target vocabularies.
Enums§
- Surgery
Method - Method used to transplant embeddings for tokens not found in both vocabularies.
Functions§
- compute_
vocab_ overlap - Compute the bidirectional overlap between two token vocabularies.
- transplant_
embeddings - Transplant embedding rows from a source model to a target model.
- validate_
surgery - Validate that the vocabulary overlap meets the configured quality threshold.