Skip to main content

Module tokenizer_surgery

Module tokenizer_surgery 

Source
Expand description

Tokenizer Surgery for Vocabulary Transplantation (GH-447)

When adapting a pre-trained model to a new tokenizer (e.g., domain-specific BPE vocabulary), embedding rows must be transplanted from the source model to the target. Tokens present in both vocabularies get direct copies; missing tokens are handled via nearest-neighbor lookup or average pooling.

§Key Design Decisions

  • Overlap threshold: Surgery is rejected if the vocabularies share fewer tokens than the configured threshold (default 50%), preventing catastrophic representation loss
  • Three methods: DirectCopy (fastest, zero-fills missing), NearestNeighbor (finds closest match), AveragePool (mean of all source embeddings for missing tokens)

§References

  • Hewitt et al. 2021: “Initializing New Word Embeddings for Pretrained Language Models”
  • Minixhofer et al. 2022: “WECHSEL: Effective Initialization of Subword Embeddings for Cross-Lingual Transfer of Monolingual Models”

§Toyota Way Principles

  • Poka-Yoke: Overlap threshold prevents silent vocabulary misalignment
  • Jidoka: Validation stops surgery if quality gate fails

Structs§

SurgeryReport
Summary report produced after an embedding transplant operation.
TokenizerSurgeryConfig
Configuration for tokenizer surgery.
VocabMapping
Bidirectional mapping between source and target vocabularies.

Enums§

SurgeryMethod
Method used to transplant embeddings for tokens not found in both vocabularies.

Functions§

compute_vocab_overlap
Compute the bidirectional overlap between two token vocabularies.
transplant_embeddings
Transplant embedding rows from a source model to a target model.
validate_surgery
Validate that the vocabulary overlap meets the configured quality threshold.