Expand description
Knowledge Distillation for Model Compression
Transfer knowledge from large teacher models to small student models using soft targets (probabilities) rather than hard labels.
§References
- [Hinton et al. 2015] “Distilling the Knowledge in a Neural Network”
§Toyota Way Principles
- Muda Elimination: Compress models to eliminate resource waste
- Standardization: Consistent soft-target training process
Structs§
- Distillation
Config - Configuration for knowledge distillation
- Distillation
Loss - Knowledge distillation loss calculator
- Distillation
Result - Distillation training result
- Linear
Distiller - Simple linear distillation model (for testing/simple cases)
- Soft
Target Generator - Soft target generator from logits
Constants§
- DEFAULT_
ALPHA - Default alpha (weight for distillation loss vs hard label loss)
- DEFAULT_
TEMPERATURE - Default distillation temperature (recommended by review)
Functions§
- binary_
cross_ entropy - Binary cross-entropy for single-class prediction
- cross_
entropy - Cross-entropy loss: CE(p, y) = -sum(y * log(p))
- kl_
divergence - KL divergence:
D_KL(P|| Q) = sum(P * log(P/Q)) - softmax
- Regular softmax (T=1)
- softmax_
temperature - Softmax with temperature scaling