Expand description
data2vec — Baevski et al. 2022, ICML.
Unified self-supervised learning via teacher-student masked prediction:
- A teacher network (EMA of the student) encodes the full, unmasked input and produces target representations.
- The student encoder receives the masked input and predicts the teacher’s representations at masked positions.
- The loss is the smooth-L1 (Huber) divergence between L2-normalised student predictions and L2-normalised teacher targets, summed only over masked tokens.
θ_teacher ← m · θ_teacher + (1−m) · θ_student [EMA update]
target_j ← target_j / (‖target[:,j]‖₂ + ε) [per-dim batch norm]
L = mean huber(student_pred − target, β) [masked positions only]Reference: “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”, Baevski et al., ICML 2022.
Structs§
- Data2
VecConfig - Hyper-parameters for the data2vec training objective.
- Data2
VecResult - Output of a single data2vec loss computation.
- Data2
VecState - Mutable state that tracks the teacher EMA parameter vector and training step.
Functions§
- data2vec_
batch_ loss - Compute the mean data2vec loss over a batch of samples.
- data2vec_
loss - Compute the data2vec loss for a single sample.
- data2vec_
mask - Generate a boolean mask of length
n_tokenswith exactlyfloor(n_tokens × mask_ratio)positions set totrue(= masked). - huber_
loss - Per-element Huber (smooth-L1) loss, averaged over all elements.
- normalize_
teacher_ targets - Normalise teacher representations along the batch dimension in-place.