hodu 0.2.4 - Docs.rs

# Gradient Tape Management Guide

## Tape Management in Optimizers

### How zero_grad() Works

All optimizers (SGD, Adam) manage the tape in the same way through the `#[derive(Optimizer)]` macro:

```rust
// Optimizer macro automatically generates zero_grad() implementation
#[derive(Optimizer)]
pub struct SGD { /* ... */ }

// Code generated by the macro:
impl Optimizer for SGD {
    fn zero_grad(&mut self, parameters: &mut [&mut Tensor]) -> HoduResult<()> {
        // 1. Zero out each parameter's gradient
        for param in parameters.iter_mut() {
            param.zero_grad()?;
        }

        // 2. Clear only the default context (context 0) tape
        hodu_core::tensor::clear_default_context_tape();

        Ok(())
    }
}
```

**Important**:
- The `#[derive(Optimizer)]` macro automatically generates the `zero_grad()` implementation
- `clear_default_context_tape()` only clears **the default context (ID: 0)**
- It does not affect custom contexts

### Using Optimizer with Default Context

```rust
use hodu_nn::optimizers::SGD;
use hodu_core::tensor::compute_gradients;

let mut optimizer = SGD::new(0.01);
let mut weight = Tensor::randn(&[10, 5], DType::F32)?.set_requires_grad(true);

for epoch in 0..100 {
    // Forward - operations are recorded in default context (0) tape
    let output = input.matmul(&weight)?;
    let loss = output.mean(&[], false)?;

    // Backward - read default context (0) tape in reverse
    compute_gradients(loss.id())?;

    // Update
    let mut params = vec![&mut weight];
    optimizer.step(&mut params)?;

    // Zero gradients + clear default context (0) tape
    optimizer.zero_grad(&mut params)?;
}
```

## Using Custom Gradient Context

### What is GradientContext?

A separate tape space for independent gradient computation.

```rust
use hodu_core::tensor::GradientContext;

{
    let _ctx = GradientContext::new();  // Create new context (e.g., ID 1)

    // Operations in this scope are recorded in context 1's tape
    let x = Tensor::randn(&[2, 3], DType::F32)?.set_requires_grad(true);
    let y = x.mul_scalar(2.0)?;
    compute_gradients(y.id())?;

    // When _ctx is dropped, context 1 and its tape are automatically deleted
}

// Back to default context (0)
```

### Important Note When Using Custom Context with Optimizer

When using a custom context, you must **manually clear the tape**:

```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};
use hodu_nn::optimizers::SGD;

let _ctx = GradientContext::new();  // Create custom context (e.g., ID 1)

let mut optimizer = SGD::new(0.01);
let mut weight = Tensor::randn(&[10, 5], DType::F32)?.set_requires_grad(true);

for epoch in 0..100 {
    // Forward - operations are recorded in context 1 tape
    let output = input.matmul(&weight)?;
    let loss = output.mean(&[], false)?;

    // Backward - read context 1 tape in reverse
    compute_gradients(loss.id())?;

    // Update
    let mut params = vec![&mut weight];
    optimizer.step(&mut params)?;

    // Zero gradients
    optimizer.zero_grad(&mut params)?;
    // Note: zero_grad() only clears default context (0),
    // so custom context (1) tape remains!

    // Manually clear the custom context tape
    clear_tape();  // Clear currently active context (1) tape
}

// When leaving scope, _ctx is dropped and context 1 itself is deleted
```

## Tape Management Functions Summary

| Function | Target | Usage |
|----------|--------|-------|
| `clear_default_context_tape()` | Default context (0) only | Called by Optimizer's `zero_grad()` |
| `clear_tape()` | Currently active context | Manual cleanup in custom contexts |
| `GradientContext::drop()` | Entire context | Automatically called when scope ends |

## Example 1: Training with Default Context

```rust
use hodu_nn::optimizers::Adam;
use hodu_core::tensor::{Tensor, compute_gradients};

// Using default context (0) - no GradientContext creation
let mut optimizer = Adam::new(0.001, 0.9, 0.999, 1e-8);
let mut weight = Tensor::randn(&[784, 10], DType::F32)?.set_requires_grad(true);
let mut bias = Tensor::zeros(&[10], DType::F32)?.set_requires_grad(true);

for epoch in 0..100 {
    // Forward
    let logits = input.matmul(&weight)?.add(&bias)?;
    let loss = logits.mean(&[], false)?;

    // Backward (uses default context 0 tape)
    compute_gradients(loss.id())?;

    // Update
    let mut params = vec![&mut weight, &mut bias];
    optimizer.step(&mut params)?;

    // zero_grad() automatically clears default context 0 tape
    optimizer.zero_grad(&mut params)?;
}
```

## Example 2: Independent Training with Custom Context

Training a main model and auxiliary model in separate contexts:

```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};
use hodu_nn::optimizers::{SGD, Adam};

// === Main model (default context 0) ===
let mut main_optimizer = Adam::new(0.001, 0.9, 0.999, 1e-8);
let mut main_weight = Tensor::randn(&[100, 50], DType::F32)?.set_requires_grad(true);

// === Auxiliary model (custom context 1) ===
let _aux_ctx = GradientContext::new();  // Create context 1
let mut aux_optimizer = SGD::new(0.01);
let mut aux_weight = Tensor::randn(&[50, 10], DType::F32)?.set_requires_grad(true);

for epoch in 0..100 {
    // === Train main model (context 0) ===
    {
        // Main model runs in default context,
        // completely independent from auxiliary model

        let main_output = input.matmul(&main_weight)?;
        let main_loss = main_output.mean(&[], false)?;

        compute_gradients(main_loss.id())?;  // Uses context 0 tape

        let mut params = vec![&mut main_weight];
        main_optimizer.step(&mut params)?;
        main_optimizer.zero_grad(&mut params)?;  // Clear context 0 tape
    }

    // === Train auxiliary model (context 1) ===
    {
        // _aux_ctx is in scope, so context 1 is active

        let aux_input = main_weight.detach()?;  // Use main model output
        let aux_output = aux_input.matmul(&aux_weight)?;
        let aux_loss = aux_output.mean(&[], false)?;

        compute_gradients(aux_loss.id())?;  // Uses context 1 tape

        let mut params = vec![&mut aux_weight];
        aux_optimizer.step(&mut params)?;
        aux_optimizer.zero_grad(&mut params)?;  // Warning: not enough!

        // Must manually clear custom context 1 tape
        clear_tape();
    }
}

// After loop ends, _aux_ctx is dropped and context 1 is automatically deleted
```

## Example 3: Nested Contexts

Performing multiple independent computations simultaneously:

```rust
use hodu_core::tensor::{GradientContext, compute_gradients, clear_tape};

// Main work in default context (0)
let main_x = Tensor::randn(&[2, 3], DType::F32)?.set_requires_grad(true);
let main_y = main_x.mul_scalar(2.0)?;

{
    // Context 1: First auxiliary computation
    let _ctx1 = GradientContext::new();

    let aux1_x = Tensor::randn(&[3, 4], DType::F32)?.set_requires_grad(true);
    let aux1_y = aux1_x.mul_scalar(3.0)?;
    compute_gradients(aux1_y.id())?;
    println!("Aux1 gradient: {:?}", aux1_x.grad()?);

    clear_tape();  // Clear context 1 tape

    {
        // Context 2: Second auxiliary computation (nested)
        let _ctx2 = GradientContext::new();

        let aux2_x = Tensor::randn(&[4, 5], DType::F32)?.set_requires_grad(true);
        let aux2_y = aux2_x.mul_scalar(4.0)?;
        compute_gradients(aux2_y.id())?;
        println!("Aux2 gradient: {:?}", aux2_x.grad()?);

        clear_tape();  // Clear context 2 tape

    } // _ctx2 drop -> context 2 deleted, return to context 1

} // _ctx1 drop -> context 1 deleted, return to default context (0)

// Continue main work (default context 0)
compute_gradients(main_y.id())?;
println!("Main gradient: {:?}", main_x.grad()?);
```

## Important Notes

### 1. Optimizer's zero_grad() Only Clears Default Context

```rust
// Incorrect usage: only calling zero_grad() in custom context
let _ctx = GradientContext::new();
// ... training loop ...
optimizer.zero_grad(&mut params)?;  // Only clears default context (0)!
// Custom context tape keeps accumulating, causing memory leak

// Correct usage: additionally call clear_tape()
let _ctx = GradientContext::new();
// ... training loop ...
optimizer.zero_grad(&mut params)?;
clear_tape();  // Clear currently active context tape
```

### 2. GradientContext Follows RAII Pattern

```rust
{
    let _ctx = GradientContext::new();
    // Perform operations
} // _ctx drop -> context and tape automatically deleted

// When leaving scope, all information for that context is destroyed
```

### 3. Default Context (0) is Never Deleted

```rust
// Default context is automatically created at program start
// and persists until program termination
// Only the tape can be cleared with clear_default_context_tape()
```

## What Happens If You Forget to Clear the Tape?

```rust
let _ctx = GradientContext::new();

for epoch in 0..10000 {
    let output = model.forward(&input)?;
    let loss = output.mean(&[], false)?;
    compute_gradients(loss.id())?;
    optimizer.step(&mut params)?;
    optimizer.zero_grad(&mut params)?;
    // Mistake: not calling clear_tape()!
}

// Result: tape records all 10000 iterations of operations
// - Memory usage spikes
// - Backpropagation slows down
// - Eventually crashes due to out of memory
```

**Solution**: Call `clear_tape()` every iteration