# Gradient Tape Management Guide
## Tape Management in Optimizers
### How zero_grad() Works
All optimizers (SGD, Adam) manage the tape in the same way through the `#[derive(Optimizer)]` macro:
```rust
// Optimizer macro automatically generates zero_grad() implementation
#[derive(Optimizer)]
pub struct SGD { /* ... */ }
// Code generated by the macro:
impl Optimizer for SGD {
fn zero_grad(&mut self, parameters: &mut [&mut Tensor]) -> HoduResult<()> {
// 1. Zero out each parameter's gradient
for param in parameters.iter_mut() {
param.zero_grad()?;
}
// 2. Clear only the default context (context 0) tape
hodu_core::tensor::clear_default_context_tape();
Ok(())
}
}
```
**Important**:
- The `#[derive(Optimizer)]` macro automatically generates the `zero_grad()` implementation
- `clear_default_context_tape()` only clears **the default context (ID: 0)**
- It does not affect custom contexts
### Using Optimizer with Default Context
```rust
use hodu_nn::optimizers::SGD;
use hodu_core::tensor::compute_gradients;
let mut optimizer = SGD::new(0.01);
let mut weight = Tensor::randn(&[10, 5], DType::F32)?.set_requires_grad(true);
for epoch in 0..100 {
// Forward - operations are recorded in default context (0) tape
let output = input.matmul(&weight)?;
let loss = output.mean(&[], false)?;
// Backward - read default context (0) tape in reverse
compute_gradients(loss.id())?;
// Update
let mut params = vec![&mut weight];
optimizer.step(&mut params)?;
// Zero gradients + clear default context (0) tape
optimizer.zero_grad(&mut params)?;
}
```
## Using Custom Gradient Context
### What is GradientContext?
A separate tape space for independent gradient computation.
```rust
use hodu_core::tensor::GradientContext;
{
let _ctx = GradientContext::new(); // Create new context (e.g., ID 1)
// Operations in this scope are recorded in context 1's tape
let x = Tensor::randn(&[2, 3], DType::F32)?.set_requires_grad(true);
let y = x.mul_scalar(2.0)?;
compute_gradients(y.id())?;
// When _ctx is dropped, context 1 and its tape are automatically deleted
}
// Back to default context (0)
```
### Important Note When Using Custom Context with Optimizer
When using a custom context, you must **manually clear the tape**:
```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};
use hodu_nn::optimizers::SGD;
let _ctx = GradientContext::new(); // Create custom context (e.g., ID 1)
let mut optimizer = SGD::new(0.01);
let mut weight = Tensor::randn(&[10, 5], DType::F32)?.set_requires_grad(true);
for epoch in 0..100 {
// Forward - operations are recorded in context 1 tape
let output = input.matmul(&weight)?;
let loss = output.mean(&[], false)?;
// Backward - read context 1 tape in reverse
compute_gradients(loss.id())?;
// Update
let mut params = vec![&mut weight];
optimizer.step(&mut params)?;
// Zero gradients
optimizer.zero_grad(&mut params)?;
// Note: zero_grad() only clears default context (0),
// so custom context (1) tape remains!
// Manually clear the custom context tape
clear_tape(); // Clear currently active context (1) tape
}
// When leaving scope, _ctx is dropped and context 1 itself is deleted
```
## Tape Management Functions Summary
| `clear_default_context_tape()` | Default context (0) only | Called by Optimizer's `zero_grad()` |
| `clear_tape()` | Currently active context | Manual cleanup in custom contexts |
| `GradientContext::drop()` | Entire context | Automatically called when scope ends |
## Example 1: Training with Default Context
```rust
use hodu_nn::optimizers::Adam;
use hodu_core::tensor::{Tensor, compute_gradients};
// Using default context (0) - no GradientContext creation
let mut optimizer = Adam::new(0.001, 0.9, 0.999, 1e-8);
let mut weight = Tensor::randn(&[784, 10], DType::F32)?.set_requires_grad(true);
let mut bias = Tensor::zeros(&[10], DType::F32)?.set_requires_grad(true);
for epoch in 0..100 {
// Forward
let logits = input.matmul(&weight)?.add(&bias)?;
let loss = logits.mean(&[], false)?;
// Backward (uses default context 0 tape)
compute_gradients(loss.id())?;
// Update
let mut params = vec![&mut weight, &mut bias];
optimizer.step(&mut params)?;
// zero_grad() automatically clears default context 0 tape
optimizer.zero_grad(&mut params)?;
}
```
## Example 2: Independent Training with Custom Context
Training a main model and auxiliary model in separate contexts:
```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};
use hodu_nn::optimizers::{SGD, Adam};
// === Main model (default context 0) ===
let mut main_optimizer = Adam::new(0.001, 0.9, 0.999, 1e-8);
let mut main_weight = Tensor::randn(&[100, 50], DType::F32)?.set_requires_grad(true);
// === Auxiliary model (custom context 1) ===
let _aux_ctx = GradientContext::new(); // Create context 1
let mut aux_optimizer = SGD::new(0.01);
let mut aux_weight = Tensor::randn(&[50, 10], DType::F32)?.set_requires_grad(true);
for epoch in 0..100 {
// === Train main model (context 0) ===
{
// Main model runs in default context,
// completely independent from auxiliary model
let main_output = input.matmul(&main_weight)?;
let main_loss = main_output.mean(&[], false)?;
compute_gradients(main_loss.id())?; // Uses context 0 tape
let mut params = vec![&mut main_weight];
main_optimizer.step(&mut params)?;
main_optimizer.zero_grad(&mut params)?; // Clear context 0 tape
}
// === Train auxiliary model (context 1) ===
{
// _aux_ctx is in scope, so context 1 is active
let aux_input = main_weight.detach()?; // Use main model output
let aux_output = aux_input.matmul(&aux_weight)?;
let aux_loss = aux_output.mean(&[], false)?;
compute_gradients(aux_loss.id())?; // Uses context 1 tape
let mut params = vec![&mut aux_weight];
aux_optimizer.step(&mut params)?;
aux_optimizer.zero_grad(&mut params)?; // Warning: not enough!
// Must manually clear custom context 1 tape
clear_tape();
}
}
// After loop ends, _aux_ctx is dropped and context 1 is automatically deleted
```
## Example 3: Nested Contexts
Performing multiple independent computations simultaneously:
```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};
// Main work in default context (0)
let main_x = Tensor::randn(&[2, 3], DType::F32)?.set_requires_grad(true);
let main_y = main_x.mul_scalar(2.0)?;
{
// Context 1: First auxiliary computation
let _ctx1 = GradientContext::new();
let aux1_x = Tensor::randn(&[3, 4], DType::F32)?.set_requires_grad(true);
let aux1_y = aux1_x.mul_scalar(3.0)?;
compute_gradients(aux1_y.id())?;
println!("Aux1 gradient: {:?}", aux1_x.grad()?);
clear_tape(); // Clear context 1 tape
{
// Context 2: Second auxiliary computation (nested)
let _ctx2 = GradientContext::new();
let aux2_x = Tensor::randn(&[4, 5], DType::F32)?.set_requires_grad(true);
let aux2_y = aux2_x.mul_scalar(4.0)?;
compute_gradients(aux2_y.id())?;
println!("Aux2 gradient: {:?}", aux2_x.grad()?);
clear_tape(); // Clear context 2 tape
} // _ctx2 drop -> context 2 deleted, return to context 1
} // _ctx1 drop -> context 1 deleted, return to default context (0)
// Continue main work (default context 0)
compute_gradients(main_y.id())?;
println!("Main gradient: {:?}", main_x.grad()?);
```
## Important Notes
### 1. Optimizer's zero_grad() Only Clears Default Context
```rust
// Incorrect usage: only calling zero_grad() in custom context
let _ctx = GradientContext::new();
// ... training loop ...
optimizer.zero_grad(&mut params)?; // Only clears default context (0)!
// Custom context tape keeps accumulating, causing memory leak
// Correct usage: additionally call clear_tape()
let _ctx = GradientContext::new();
// ... training loop ...
optimizer.zero_grad(&mut params)?;
clear_tape(); // Clear currently active context tape
```
### 2. GradientContext Follows RAII Pattern
```rust
{
let _ctx = GradientContext::new();
// Perform operations
} // _ctx drop -> context and tape automatically deleted
// When leaving scope, all information for that context is destroyed
```
### 3. Default Context (0) is Never Deleted
```rust
// Default context is automatically created at program start
// and persists until program termination
// Only the tape can be cleared with clear_default_context_tape()
```
## What Happens If You Forget to Clear the Tape?
```rust
let _ctx = GradientContext::new();
for epoch in 0..10000 {
let output = model.forward(&input)?;
let loss = output.mean(&[], false)?;
compute_gradients(loss.id())?;
optimizer.step(&mut params)?;
optimizer.zero_grad(&mut params)?;
// Mistake: not calling clear_tape()!
}
// Result: tape records all 10000 iterations of operations
// - Memory usage spikes
// - Backpropagation slows down
// - Eventually crashes due to out of memory
```
**Solution**: Call `clear_tape()` every iteration