# Transfer Learning Theory
Transfer learning leverages knowledge from one task to improve performance on related tasks, dramatically reducing data requirements and training time.
## The Transfer Learning Paradigm
```
Source Domain (Large Data) Target Domain (Limited Data)
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Pre-train │ │ Fine-tune │
│ on ImageNet │ ──Transfer──▶ │ on Custom │
│ (1M images) │ │ (1K images) │
└───────────────┘ └───────────────┘
```
## Why Transfer Learning Works
### Feature Hierarchy
Neural networks learn hierarchical features:
| Early | Edges, colors, textures | High (universal) |
| Middle | Shapes, parts | Medium |
| Late | Task-specific patterns | Low |
Early layers learn **general features** that apply across domains.
### The Lottery Ticket Hypothesis
Pre-trained networks contain "winning tickets" - subnetworks that train well on new tasks. Transfer learning finds these tickets without expensive search.
## Transfer Strategies
### 1. Feature Extraction (Frozen Base)
```
Pre-trained Model New Task
┌─────────────────┐ ┌────────┐
│ Base Layers │──────▶│ New │──▶ Output
│ (Frozen) │ │ Head │
└─────────────────┘ └────────┘
```
- Freeze pre-trained layers
- Only train new classification head
- Best when: Target data is very limited
### 2. Fine-Tuning (Unfrozen Base)
```
Pre-trained Model New Task
┌─────────────────┐ ┌────────┐
│ Base Layers │──────▶│ New │──▶ Output
│ (Trainable) │ │ Head │
└─────────────────┘ └────────┘
```
- Train entire network with small learning rate
- Base layers: lr × 0.01-0.1
- Head layers: lr × 1.0
- Best when: Moderate target data available
### 3. Gradual Unfreezing
Progressive unfreezing from top to bottom:
```
Epoch 1: Train head only
Epoch 2: Unfreeze top base layer
Epoch 3: Unfreeze next layer
...
Epoch N: All layers trainable
```
Prevents catastrophic forgetting of pre-trained knowledge.
## Domain Adaptation
When source and target distributions differ:
### Discrepancy-Based Methods
Minimize distribution distance:
```
L = L_task + λ · MMD(source, target)
```
Where MMD = Maximum Mean Discrepancy.
### Adversarial Methods (DANN)
Domain Adversarial Neural Network:
```
Features → Task Classifier (maximize)
│
└────▶ Domain Classifier (minimize via gradient reversal)
```
Features become domain-invariant.
## Multi-Task Learning
Learn multiple related tasks simultaneously:
```
Input
│
▼
┌─────────┐
│ Shared │
│ Encoder │
└────┬────┘
│
┌────┴────┐
│ │
▼ ▼
┌──────┐ ┌──────┐
│Task A│ │Task B│
│ Head │ │ Head │
└──────┘ └──────┘
```
Benefits:
- Improved generalization through regularization
- Data efficiency (shared representation)
- Faster training (parallel tasks)
## Low-Rank Adaptation (LoRA)
Efficient fine-tuning for large models:
Instead of updating W directly:
```
W' = W + ΔW
```
Decompose update as low-rank:
```
ΔW = B × A
where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)
```
Parameters: O(r(d+k)) vs O(dk)
Example: GPT-3 (175B params) → LoRA (0.1% trainable)
## Adapter Layers
Insert small trainable modules:
```
Original Layer: x → [Frozen Transformer] → y
With Adapter: x → [Frozen Transformer] → [Adapter] → y + x
↓
Down → ReLU → Up
(d→r) (r→d)
```
Only adapters train; base model frozen.
## Knowledge Distillation
Transfer knowledge from large to small model:
```
Teacher (Large) Student (Small)
│ │
▼ ▼
Logits ───────────▶ Logits
│ KL │
│ Divergence │
▼ ▼
Labels ──────────────▶ Cross-Entropy
```
Loss:
```
L = α · KL(softmax(t_logits/T), softmax(s_logits/T))
+ (1-α) · CE(s_logits, labels)
```
Temperature T smooths distributions for better transfer.
## Negative Transfer
When transfer hurts performance:
**Causes:**
- Source and target too dissimilar
- Conflicting label spaces
- Domain shift too large
**Mitigation:**
- Measure domain similarity before transfer
- Use regularization to prevent forgetting
- Selective layer transfer
## Best Practices
### 1. Choosing What to Transfer
| Target Data | Source Similarity | Strategy |
|-------------|-------------------|----------|
| Small | High | Feature extraction |
| Small | Low | Careful fine-tuning |
| Large | High | Full fine-tuning |
| Large | Low | Train from scratch |
### 2. Learning Rate Schedule
```
Head: lr = 1e-3
Upper layers: lr = 1e-4
Lower layers: lr = 1e-5
```
Discriminative fine-tuning preserves pre-trained knowledge.
### 3. Data Augmentation
Apply to target domain to increase effective data size:
- Image: rotation, flip, crop, color jitter
- Text: back-translation, synonym replacement
- Audio: time stretch, pitch shift, noise
## Applications
| Domain | Source Task | Target Task |
|--------|-------------|-------------|
| Vision | ImageNet | Medical imaging |
| NLP | Language modeling | Sentiment analysis |
| Speech | ASR pre-training | Voice commands |
| Code | General transpiler | Language-specific |
## References
- Yosinski, J., et al. (2014). "How transferable are features in deep neural networks?" NeurIPS.
- Hu, E. J., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv.
- Houlsby, N., et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML.
- Ganin, Y., et al. (2016). "Domain-Adversarial Training of Neural Networks." JMLR.