Module tokenizer_surgery

Expand description

Tokenizer Surgery for Vocabulary Transplantation (GH-447)

When adapting a pre-trained model to a new tokenizer (e.g., domain-specific BPE vocabulary), embedding rows must be transplanted from the source model to the target. Tokens present in both vocabularies get direct copies; missing tokens are handled via nearest-neighbor lookup or average pooling.

§Key Design Decisions

Overlap threshold: Surgery is rejected if the vocabularies share fewer tokens than the configured threshold (default 50%), preventing catastrophic representation loss
Three methods: DirectCopy (fastest, zero-fills missing), NearestNeighbor (finds closest match), AveragePool (mean of all source embeddings for missing tokens)

§References

Hewitt et al. 2021: “Initializing New Word Embeddings for Pretrained Language Models”
Minixhofer et al. 2022: “WECHSEL: Effective Initialization of Subword Embeddings for Cross-Lingual Transfer of Monolingual Models”

§Toyota Way Principles

Poka-Yoke: Overlap threshold prevents silent vocabulary misalignment
Jidoka: Validation stops surgery if quality gate fails

Structs§

SurgeryReport: Summary report produced after an embedding transplant operation.
TokenizerSurgeryConfig: Configuration for tokenizer surgery.
VocabMapping: Bidirectional mapping between source and target vocabularies.

Enums§

SurgeryMethod: Method used to transplant embeddings for tokens not found in both vocabularies.

Functions§

compute_vocab_overlap: Compute the bidirectional overlap between two token vocabularies.
transplant_embeddings: Transplant embedding rows from a source model to a target model.
validate_surgery: Validate that the vocabulary overlap meets the configured quality threshold.

Module tokenizer_surgery

Module tokenizer_surgery Copy item path

§Key Design Decisions

§References

§Toyota Way Principles

Structs§

Enums§

Functions§

Module tokenizer_surgery