ferrotorch-nn 0.6.1

//! Dropout regularization layers.
//!
//! [`Dropout`] randomly zeroes individual elements during training with
//! probability `p`, scaling surviving elements by `1/(1-p)` (inverted
//! dropout). [`Dropout1d`], [`Dropout2d`], and [`Dropout3d`] drop entire
//! channels instead of individual elements, for 3D, 4D, and 5D inputs
//! respectively. [`AlphaDropout`] preserves mean and variance for use
//! with SELU activations.
//!
//! All six CPU forward paths draw their keep-mask from the byte-exact
//! MT19937 `Generator` (`ferrotorch_core::rng`) with torch's exact
//! consumption — per element ([`Dropout`], [`AlphaDropout`]) or per `[N, C]`
//! channel ([`Dropout1d`]/[`Dropout2d`]/[`Dropout3d`],
//! [`FeatureAlphaDropout`]) in flat order, keep iff `next_uniform_f64() <
//! (1 - p)` — so `ferrotorch_core::manual_seed(s)` reproduces
//! `torch.manual_seed(s); F.dropout{,1d,2d,3d}` / `nn.AlphaDropout` /
//! `nn.FeatureAlphaDropout` byte-for-byte (#1634, #1635, #1636). The alpha
//! variants use torch's hardcoded `alpha = 1.7580993408473766` affine
//! (`aten/src/ATen/native/Dropout.cpp:76`).
//!
//! All modules are identity in eval mode and have zero learnable parameters.
//!
//! ## REQ status (per `.design/ferrotorch-nn/dropout.md`)
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 | SHIPPED | impl: `pub struct Dropout<T: Float>` here with `p` / `training` fields + ctor rejecting `p` outside `[0,1)`; non-test consumer: `Dropout::<T>::new(0.5)?` invoked in `ferrotorch-vision/src/models/vgg.rs` (the VGG classifier head dropout). |
//! | REQ-2 | SHIPPED | impl: `<Dropout as Module>::forward` body with eval / `p==0` short-circuit + Bernoulli + scale here; non-test consumer: `Dropout::forward` is called on every forward pass through the VGG / Inception classifier (constructed in `vgg.rs` and `inception.rs`). |
//! | REQ-3 | SHIPPED | impl: `input.is_cuda() && backend = ferrotorch_core::gpu_dispatch::gpu_backend()` GPU branch inside `<Dropout as Module>::forward` here; non-test consumer: any vision model run on CUDA (e.g. VGG / Inception fine-tuning with parameters on GPU) triggers this on every forward step. |
//! | REQ-4 | SHIPPED | impl: `struct DropoutBackward<T>` + `GradFn` impl here; non-test consumer: every `loss.backward()` over a model containing `Dropout` traverses these nodes via the autograd engine. |
//! | REQ-5 | SHIPPED | impl: `pub struct Dropout2d<T: Float>` + `Module` impl here; per-channel keep-mask drawn from the byte-exact MT19937 `Generator` (`make_feature_noise(input).bernoulli_(1-p)`, `Dropout.cpp:73-74`, keep iff `u < 1-p`), reproducing `torch.manual_seed(s); F.dropout2d` byte-for-byte (#1635, pinned by `divergence_dropout_seed_extended_and_feature_1634.rs::dropout2d_seed42_per_channel_matches_torch` vs live torch 2.11); non-test consumer: `pub use dropout::Dropout2d` in `lib.rs` exposes for downstream vision / segmentation code. |
//! | REQ-6 | SHIPPED | impl: `pub struct Dropout1d<T: Float>` + `Module` impl here; per-channel MT19937 mask (#1635, pinned by `dropout1d_seed42_per_channel_matches_torch`); non-test consumer: `pub use dropout::Dropout1d` in `lib.rs`. |
//! | REQ-7 | SHIPPED | impl: `pub struct Dropout3d<T: Float>` + `Module` impl here; per-channel MT19937 mask (#1635, pinned by `dropout3d_seed42_per_channel_matches_torch`); non-test consumer: `pub use dropout::Dropout3d` in `lib.rs`. |
//! | REQ-8 | SHIPPED | impl: `struct Dropout2dBackward<T>` + `GradFn` impl here; non-test consumer: autograd engine traversal on any model using `Dropout2d` in training. |
//! | REQ-9 | SHIPPED | impl: `pub struct AlphaDropout<T: Float>` + torch's EXACT alpha affine inside `<AlphaDropout as Module>::forward` here — per-element MT19937 keep-mask (keep iff `u < 1-p`) + `alpha = 1.7580993408473766` (`ALPHA_DROPOUT_ALPHA`, torch's hardcoded literal at `Dropout.cpp:76`, NOT recomputed `SELU_LAMBDA*SELU_ALPHA`), `a = 1/sqrt((alpha^2*p+1)*(1-p))`, kept = `a*x+alpha*a*p`, dropped = `-alpha*a+alpha*a*p` (`Dropout.cpp:74-79`), reproducing `torch.manual_seed(s); nn.AlphaDropout(p)` byte-for-byte (#1636, pinned by `divergence_dropout_seed_extended_and_feature_1634.rs::alpha_dropout_seed42_matches_torch` vs live torch 2.11); non-test consumer: `pub use dropout::AlphaDropout` in `lib.rs`. |
//! | REQ-10 | SHIPPED | impl: `struct AlphaDropoutBackward<T>` + `GradFn` impl here; non-test consumer: autograd engine traversal on models using `AlphaDropout`. |
//! | REQ-11 | SHIPPED | impl: 5 `Module<T> for <DropoutKind><T>` impl blocks here, each returning `vec![]` for parameters; non-test consumer: `ferrotorch_optim::Optimizer` walks `Module::parameters_mut()` of containers; dropout returns an empty list (correct: dropout has no trainable parameters). |
//! | REQ-12 | SHIPPED | impl: `with_inplace` builder + `inplace` getter + `inplace` field on all six dropout structs, the autograd-safe `apply_inplace_dropout` helper (errors on grad-requiring leaf per torch `VariableTypeUtils.h:80-84`; out-of-place fallback on grad-requiring non-leaf — R-DEV-7, ferrotorch lacks torch's version counter `saved_variable.cpp:170-186`; raw `write_inplace`/`Tensor::update_data` only on the non-grad-tracked path), and the `if self.inplace { apply_inplace_dropout(input, &output_data)? }` branch in `<Dropout/Dropout1d/Dropout2d/Dropout3d as Module>::forward` here, mirroring `_VF.dropout_`/`_VF.feature_dropout_` at `torch/nn/functional.py:1449,1516,1579,1629` on the memory-opt path; `AlphaDropout`/`FeatureAlphaDropout` carry the field for ABI parity but match torch's module forward which never forwards `inplace` (`dropout.py:265-269,319-323`). Non-test production consumer: the `if self.inplace` branch is on the live forward path of `<Dropout as Module>::forward` here, exercised by `ferrotorch-nn/src/lora.rs` (LoRA input dropout), `ferrotorch-vision/src/models/vgg.rs` / `inception.rs` (classifier head), and `ferrotorch-graph/src/gcn.rs` (inter-layer dropout). Default `inplace=false` preserves existing behavior. Closes #1446, #1580, #1581. |
//! | REQ-13 | SHIPPED | impl: `pub struct FeatureAlphaDropout<T: Float>` + `FeatureAlphaDropoutBackward<T>` + `Module<T>` impl here — per-channel MT19937 keep-mask (`make_feature_noise` flat `[N,C]` Bernoulli, keep iff `u < 1-p`) broadcast over `[N, C, *]`, torch's EXACT alpha affine (`alpha = 1.7580993408473766`, kept = `a*x+alpha*a*p`, dropped = `-alpha*a+alpha*a*p`, `Dropout.cpp:73-79`), reproducing `torch.manual_seed(s); nn.FeatureAlphaDropout(p)` byte-for-byte (#1636, pinned by `divergence_dropout_seed_extended_and_feature_1634.rs::feature_alpha_dropout_seed42_matches_torch` vs live torch 2.11); closes #1448; non-test consumer: `pub use dropout::FeatureAlphaDropout` in `lib.rs` (re-export) exposes the layer to downstream self-normalising-network model code in `ferrotorch-vision` / `ferrotorch-llama`. |
//! | REQ-14 | NOT-STARTED | blocker #1441 (umbrella) — `Dropout2d` / `Dropout1d` / `Dropout3d` GPU forward absent (CUDA inputs return `NotImplementedOnCuda`). Parity-sweep runner arms also absent. |

use std::sync::Arc;

use ferrotorch_core::autograd::no_grad::is_grad_enabled;
use ferrotorch_core::gpu_dispatch::GpuRngState;
use ferrotorch_core::tensor::GradFn;
use ferrotorch_core::{FerrotorchError, FerrotorchResult, Float, Tensor, TensorStorage};

use crate::module::Module;
use crate::parameter::Parameter;

// ---------------------------------------------------------------------------
// Philox 4x32-10 for CPU-side mask regeneration
// ---------------------------------------------------------------------------
// We need the Philox algorithm on CPU to regenerate dropout masks during
// backward for GPU tensors (the forward mask was generated on GPU using
// the Philox state). This is a copy of the core algorithm from
// ferrotorch-gpu/src/rng.rs to avoid a dependency on the GPU crate.

#[allow(dead_code)]
const PHILOX_M0: u32 = 0xD2511F53;
#[allow(dead_code)]
const PHILOX_M1: u32 = 0xCD9E8D57;
#[allow(dead_code)]
const PHILOX_W0: u32 = 0x9E3779B9;
#[allow(dead_code)]
const PHILOX_W1: u32 = 0xBB67AE85;

#[allow(dead_code)]
#[inline]
fn philox_round(c0: u32, c1: u32, c2: u32, c3: u32, k0: u32, k1: u32) -> (u32, u32, u32, u32) {
    let prod0 = (PHILOX_M0 as u64) * (c0 as u64);
    let hi0 = (prod0 >> 32) as u32;
    let lo0 = prod0 as u32;

    let prod1 = (PHILOX_M1 as u64) * (c2 as u64);
    let hi1 = (prod1 >> 32) as u32;
    let lo1 = prod1 as u32;

    let new_c0 = hi1 ^ c1 ^ k0;
    let new_c1 = lo1;
    let new_c2 = hi0 ^ c3 ^ k1;
    let new_c3 = lo0;

    (new_c0, new_c1, new_c2, new_c3)
}

/// Philox 4x32-10: produces 4 uniform u32 values from (counter, key).
#[allow(dead_code)]
fn philox_4x32_10(counter: u64, key: u64) -> [u32; 4] {
    let mut c0 = counter as u32;
    let mut c1 = (counter >> 32) as u32;
    let mut c2 = 0u32;
    let mut c3 = 0u32;

    let mut k0 = key as u32;
    let mut k1 = (key >> 32) as u32;

    for _ in 0..9 {
        (c0, c1, c2, c3) = philox_round(c0, c1, c2, c3, k0, k1);
        k0 = k0.wrapping_add(PHILOX_W0);
        k1 = k1.wrapping_add(PHILOX_W1);
    }
    // Round 10 (final, no key advance)
    (c0, c1, c2, c3) = philox_round(c0, c1, c2, c3, k0, k1);

    [c0, c1, c2, c3]
}

/// Generate a dropout mask using the Philox algorithm, matching the GPU kernel's
/// behavior. The mask uses `(counter ^ seed)` as a derived u32 seed and applies
/// the same xorshift-multiply hash that the GPU dropout kernel uses.
///
/// This ensures backward mask matches the forward mask generated on GPU.
fn philox_dropout_mask<T: Float>(
    numel: usize,
    threshold: u32,
    scale: T,
    rng_state: &GpuRngState,
) -> Vec<T> {
    let zero = <T as num_traits::Zero>::zero();
    let derived_seed = (rng_state.counter() ^ rng_state.seed()) as u32;

    (0..numel)
        .map(|i| {
            let mut r = (i as u32).wrapping_mul(2654435761) ^ derived_seed;
            r ^= r << 13;
            r ^= r >> 17;
            r ^= r << 5;
            if r < threshold { zero } else { scale }
        })
        .collect()
}

// ---------------------------------------------------------------------------
// In-place storage write
// ---------------------------------------------------------------------------

/// Whether the in-place dropout policy actually mutated the input storage, or
/// suppressed the mutation for autograd safety. See [`apply_inplace_dropout`].
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
enum InplaceOutcome {
    /// The input storage was mutated in place (`inplace=true` honored).
    Mutated,
    /// The mutation was suppressed for autograd safety; the caller must build
    /// the output out-of-place from the freshly-allocated `output_data` buffer.
    FellBackToOutOfPlace,
}

/// Apply the in-place dropout policy, mutating `input`'s storage where it is
/// autograd-safe to do so and matching torch's observable error contract where
/// ferrotorch can.
///
/// # Autograd safety (R-DEV-7 deviation — documented)
///
/// torch enforces in-place autograd correctness with two mechanisms that
/// ferrotorch's autograd engine does NOT have:
///
/// 1. A **leaf in-place guard** — mutating a leaf that requires grad raises
///    `"a leaf Variable that requires grad is being used in an in-place
///    operation."` from `check_inplace`
///    (`torch/csrc/autograd/VariableTypeUtils.h:61-63,80-84`).
/// 2. A **version counter** — every saved tensor records the storage version it
///    was saved at; if an in-place op bumps that version before backward,
///    `SavedVariable::unpack` raises `"one of the variables needed for gradient
///    computation has been modified by an inplace operation"`
///    (`torch/csrc/autograd/saved_variable.cpp:170-186`).
///
/// ferrotorch has neither (no `version` field on `TensorInner`; `Tensor::clone`
/// shares the `Arc<TensorInner>` storage). Without a version counter it cannot
/// *detect* that another backward node saved the pre-mutation storage, so an
/// unconditional in-place write silently corrupts that branch's gradient
/// (#1580). To eliminate the corruption rather than risk it, this helper adopts
/// a conservative policy on the grad-tracked path:
///
/// * **Leaf requiring grad, grad enabled** → return an `Err` mirroring torch's
///   leaf-guard message. (Matches torch exactly; pins #1581.)
/// * **Non-leaf requiring grad, grad enabled** → do NOT mutate; signal
///   [`InplaceOutcome::FellBackToOutOfPlace`] so the caller builds a fresh
///   output. The result tensor is numerically identical and the gradient is
///   correct (no shared-storage corruption); this is *more permissive* than
///   torch's version-counter `RuntimeError` — ferrotorch cannot prove the
///   storage is unused by another backward without a version counter, so it
///   declines to mutate instead of erroring. (Eliminates #1580's corruption.)
/// * **Grad disabled, or input does not require grad** → mutate in place. This
///   is the real memory-optimization case; no autograd node observes the
///   storage, so it is graph-safe and matches torch's `_VF.dropout_`.
///
/// The deviation preserves torch's *observable result* (identical output,
/// correct gradient) while declining to replicate torch's runtime error on the
/// non-leaf path, because ferrotorch lacks the version-counter infrastructure
/// that error depends on.
fn apply_inplace_dropout<T: Float>(
    input: &Tensor<T>,
    new_data: &[T],
) -> FerrotorchResult<InplaceOutcome> {
    if is_grad_enabled() && input.requires_grad() {
        if input.is_leaf() {
            // Match torch's leaf in-place guard
            // (`torch/csrc/autograd/VariableTypeUtils.h:80-84`).
            return Err(FerrotorchError::InvalidArgument {
                message:
                    "a leaf Variable that requires grad is being used in an in-place operation."
                        .to_string(),
            });
        }
        // Non-leaf requiring grad: ferrotorch has no version counter to prove
        // the shared storage is unused by another saved-for-backward node, so
        // fall back to out-of-place rather than risk corrupting that branch's
        // gradient (#1580). The caller builds the output from `new_data`.
        return Ok(InplaceOutcome::FellBackToOutOfPlace);
    }

    // Grad disabled or input does not require grad: the genuine
    // memory-optimization case. No autograd node can observe the storage, so
    // the in-place write is graph-safe and matches torch's `_VF.dropout_`.
    write_inplace(input, new_data)?;
    Ok(InplaceOutcome::Mutated)
}

/// Write `new_data` over `input`'s storage in place, mirroring torch's
/// `_VF.dropout_` family (`torch/nn/functional.py:1449,1516,1579,1629`)
/// which mutate the input tensor's buffer rather than allocating a fresh
/// output.
///
/// This is the raw write; the autograd-safety policy that decides *whether* a
/// write is permitted lives in [`apply_inplace_dropout`]. Callers must route
/// through that helper and never call this directly on a grad-tracked path.
fn write_inplace<T: Float>(input: &Tensor<T>, new_data: &[T]) -> FerrotorchResult<()> {
    // SAFETY: `update_data` requires exclusive access to the input's storage
    // for the duration of the write. The dropout forward holds the only live
    // borrow of the input data (consumed into `new_data` by the caller before
    // this call). The autograd-safety policy in `apply_inplace_dropout`
    // guarantees this is only reached when grad is disabled or the input does
    // not require grad, so no backward node has saved (and could later read) a
    // version of this storage. `new_data.len() == input.numel()` is guaranteed
    // by the callers (the mask and input share numel). PyTorch performs this
    // exact mutation in `_VF.dropout_` (`torch/nn/functional.py:1449`).
    #[allow(
        clippy::undocumented_unsafe_blocks,
        reason = "SAFETY comment above documents the exclusive-access invariant; apply_inplace_dropout gates this to the non-grad-tracked path where no backward node observes the storage"
    )]
    unsafe {
        input.update_data(new_data)?;
    }
    Ok(())
}

// ---------------------------------------------------------------------------
// DropoutBackward
// ---------------------------------------------------------------------------

/// Backward node for elementwise dropout.
///
/// Reapplies the same binary mask scaled by `1/(1-p)` to the upstream
/// gradient, routing gradients only through surviving elements.
///
/// The mask is stored as a [`Tensor<T>`] on the same device as the
/// forward input so backward reduces to a single `mul` that stays
/// GPU-native when the input is on CUDA.
#[derive(Debug)]
struct DropoutBackward<T: Float> {
    input: Tensor<T>,
    /// Mask tensor with elements in `{0, 1/(1-p)}`. Lives on the same
    /// device as `input`, so `mul(grad_output, scaled_mask)` in the
    /// backward routes entirely through GPU ops when training on CUDA.
    scaled_mask: Tensor<T>,
}

impl<T: Float> GradFn<T> for DropoutBackward<T> {
    fn backward(&self, grad_output: &Tensor<T>) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
        let da = if self.input.requires_grad() {
            let g = ferrotorch_core::grad_fns::arithmetic::mul(grad_output, &self.scaled_mask)?;
            Some(g)
        } else {
            None
        };
        Ok(vec![da])
    }

    fn inputs(&self) -> Vec<&Tensor<T>> {
        vec![&self.input]
    }

    fn name(&self) -> &'static str {
        "DropoutBackward"
    }
}

// ---------------------------------------------------------------------------
// Dropout2dBackward
// ---------------------------------------------------------------------------

/// Backward node for channel-wise dropout.
///
/// Identical to [`DropoutBackward`] — the mask shape already encodes the
/// channel-level structure (all spatial positions in a dropped channel are 0).
#[derive(Debug)]
struct Dropout2dBackward<T: Float> {
    input: Tensor<T>,
    scaled_mask: Vec<T>,
}

impl<T: Float> GradFn<T> for Dropout2dBackward<T> {
    fn backward(&self, grad_output: &Tensor<T>) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
        if grad_output.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda {
                op: "dropout2d backward",
            });
        }
        let da = if self.input.requires_grad() {
            let go_data = grad_output.data_vec()?;
            let grad_a: Vec<T> = go_data
                .iter()
                .zip(self.scaled_mask.iter())
                .map(|(&g, &m)| g * m)
                .collect();
            let g = Tensor::from_storage(
                TensorStorage::cpu(grad_a),
                self.input.shape().to_vec(),
                false,
            )?;
            Some(g)
        } else {
            None
        };
        Ok(vec![da])
    }

    fn inputs(&self) -> Vec<&Tensor<T>> {
        vec![&self.input]
    }

    fn name(&self) -> &'static str {
        "Dropout2dBackward"
    }
}

// ===========================================================================
// Dropout
// ===========================================================================

/// Randomly zeroes elements with probability `p` during training.
///
/// During training, each element is independently set to zero with probability
/// `p` and scaled by `1/(1-p)` so that the expected value is preserved
/// (inverted dropout).  During evaluation (`eval()` mode), the input is
/// returned unchanged.
///
/// # Panics
///
/// The constructor returns an error if `p` is outside `[0, 1)`.
#[derive(Debug)]
pub struct Dropout<T: Float> {
    p: f64,
    training: bool,
    /// When `true`, the forward mutates the input tensor's storage in place
    /// (mask + scale written back over the input) instead of allocating a
    /// fresh output buffer. Mirrors `_DropoutNd.inplace` at
    /// `torch/nn/modules/dropout.py:29` and the `inplace` branch of
    /// `F.dropout` at `torch/nn/functional.py:1448-1450`
    /// (`_VF.dropout_(input, p, training) if inplace`).
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

impl<T: Float> Dropout<T> {
    /// Create a new `Dropout` layer.
    ///
    /// `p` is the probability of an element being zeroed. Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("dropout probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag, mirroring `torch.nn.Dropout(p, inplace=...)`
    /// at `torch/nn/modules/dropout.py:22-29`. When `true`, training-mode
    /// forward mutates the input storage instead of allocating a new buffer.
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

impl<T: Float> Module<T> for Dropout<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        // Eval mode or p == 0: identity.
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        let numel = input.numel();
        let scale = T::from(1.0 / (1.0 - self.p)).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        // GPU fast path: run dropout kernel entirely on device using the
        // Philox CBRNG. This integrates with the global GPU RNG state so
        // that gradient checkpointing can reproduce identical masks.
        if input.is_cuda() {
            if let Some(backend) = ferrotorch_core::gpu_dispatch::gpu_backend() {
                let threshold = (self.p * u32::MAX as f64) as u32;
                let scale_f32 = 1.0f32 / (1.0 - self.p as f32);

                let (handle, rng_state) =
                    backend.dropout_philox_f32(input.gpu_handle()?, threshold, scale_f32)?;

                // For backward, we need the mask. Regenerate it from the saved
                // Philox RNG state using the same deterministic hash that the
                // GPU kernel uses. This is reproducible across checkpoint
                // save/restore because the Philox state is deterministic.
                if is_grad_enabled() && input.requires_grad() {
                    let scaled_mask_vec = philox_dropout_mask(numel, threshold, scale, &rng_state);
                    // Upload the mask to the input's device so the
                    // backward `mul` runs on-device without a CPU
                    // round-trip.
                    let mask_cpu = Tensor::from_storage(
                        TensorStorage::cpu(scaled_mask_vec),
                        input.shape().to_vec(),
                        false,
                    )?;
                    let scaled_mask = mask_cpu.to(input.device())?;
                    return Tensor::from_operation(
                        TensorStorage::gpu(handle),
                        input.shape().to_vec(),
                        Arc::new(DropoutBackward {
                            input: input.clone(),
                            scaled_mask,
                        }),
                    );
                } else {
                    return Tensor::from_storage(
                        TensorStorage::gpu(handle),
                        input.shape().to_vec(),
                        false,
                    );
                }
            }
        }

        // CPU path — draw the keep-mask from the byte-exact MT19937
        // `Generator` (`ferrotorch_core::rng`) using torch's EXACT CPU dropout
        // consumption, so `ferrotorch_core::manual_seed(s); Dropout::forward`
        // reproduces `torch.manual_seed(s); F.dropout(...)` byte-for-byte
        // (#1634). torch draws the mask via `noise.bernoulli_(1 - p)`
        // (`aten/src/ATen/native/Dropout.cpp:74`); the scalar bernoulli kernel
        // (`aten/src/ATen/native/cpu/DistributionTemplates.h:388-399`)
        // evaluates per element in flat order
        // `transformation::bernoulli<double>(uniform_real<double>(gen->random64(), 0, 1), 1 - p)`
        // = `uniform64 < (1 - p)` (keep == 1)
        // (`DistributionsHelper.h:107-113,219-222`,
        // `TransformationHelper.h:84-89,171-173`).
        // `uniform_real<double>(random64(), 0, 1)` is exactly
        // `Generator::next_uniform_f64` (rng.rs REQ-5, byte-exact); survivors
        // are scaled by `1/(1-p)` (`Dropout.cpp:81` `noise.div_(1 - p)`).
        let keep_prob = 1.0 - self.p;
        let scaled_mask_vec: Vec<T> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..numel)
                .map(|_| {
                    if g.next_uniform_f64() < keep_prob {
                        scale
                    } else {
                        zero
                    }
                })
                .collect()
        });

        let input_data = input.data()?;
        let output_data: Vec<T> = input_data
            .iter()
            .zip(scaled_mask_vec.iter())
            .map(|(&x, &m)| x * m)
            .collect();

        // In-place branch, mirroring `_VF.dropout_(input, p, training)` at
        // `torch/nn/functional.py:1449`. `apply_inplace_dropout` applies the
        // autograd-safe policy: it errors on a grad-requiring leaf (matching
        // torch), falls back to out-of-place for a grad-requiring non-leaf
        // (ferrotorch lacks torch's version counter, so it declines to mutate
        // shared storage), and mutates in place only when no autograd node can
        // observe the storage. The out-of-place output below is always built
        // from `output_data`, so the fallback needs no special handling here.
        if self.inplace {
            apply_inplace_dropout(input, &output_data)?;
        }

        if is_grad_enabled() && input.requires_grad() {
            let scaled_mask = Tensor::from_storage(
                TensorStorage::cpu(scaled_mask_vec),
                input.shape().to_vec(),
                false,
            )?;
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(DropoutBackward {
                    input: input.clone(),
                    scaled_mask,
                }),
            )
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )
        }
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// Dropout2d
// ===========================================================================

/// Randomly zeroes entire channels with probability `p` during training.
///
/// Expects input of shape `[B, C, ...]` (at least 2 dimensions). During
/// training, each channel (the entire `[H, W, ...]` slice for a given `b, c`)
/// is independently set to zero with probability `p` and surviving channels
/// are scaled by `1/(1-p)`.  During evaluation the input is returned unchanged.
///
/// # Panics
///
/// The constructor returns an error if `p` is outside `[0, 1)`.
#[derive(Debug)]
pub struct Dropout2d<T: Float> {
    p: f64,
    training: bool,
    /// In-place flag, mirroring `_DropoutNd.inplace` at
    /// `torch/nn/modules/dropout.py:29` and the `inplace` branch of
    /// `F.dropout2d` at `torch/nn/functional.py:1578-1582`
    /// (`_VF.feature_dropout_(input, p, training) if inplace`).
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

impl<T: Float> Dropout2d<T> {
    /// Create a new `Dropout2d` layer.
    ///
    /// `p` is the probability of an entire channel being zeroed. Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("dropout2d probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag, mirroring `torch.nn.Dropout2d(p, inplace=...)`.
    /// When `true`, training-mode forward mutates the input storage.
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

impl<T: Float> Module<T> for Dropout2d<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        // Eval mode or p == 0: identity.
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        let shape = input.shape();
        if shape.len() < 2 {
            return Err(FerrotorchError::InvalidArgument {
                message: format!(
                    "Dropout2d expects at least 2D input [B, C, ...], got shape {:?}",
                    shape
                ),
            });
        }

        let batch = shape[0];
        let channels = shape[1];
        // Product of empty slice is 1, so no special case needed for 2-D inputs.
        let spatial: usize = shape[2..].iter().product();

        let numel = input.numel();
        let scale = T::from(1.0 / (1.0 - self.p)).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        // GPU tensors are not yet supported for Dropout2d — needs a fused
        // channel-broadcast dropout kernel.
        if input.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda { op: "Dropout2d" });
        }

        // CPU path — draw the per-channel keep mask from the byte-exact
        // MT19937 `Generator` (`ferrotorch_core::rng`), matching torch's
        // `make_feature_noise(input).bernoulli_(1 - p)`
        // (`aten/src/ATen/native/Dropout.cpp:73-74`). torch reduces the input
        // to a `[N, C, 1, 1...]` noise tensor and draws ONE Bernoulli per
        // `[N, C]` entry in flat order, then broadcasts over the spatial dims
        // and scales survivors by `1/(1-p)` (`Dropout.cpp:81` `noise.div_(1-p)`).
        // The scalar bernoulli kernel keeps iff `next_uniform_f64() < (1 - p)`
        // (`DistributionTemplates.h` / `TransformationHelper.h:171-173`), so a
        // shared `ferrotorch_core::manual_seed(s)` reproduces
        // `torch.manual_seed(s); F.dropout2d(...)` byte-for-byte (#1635).
        let keep_prob = 1.0 - self.p;
        let channel_mask: Vec<bool> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..batch * channels)
                .map(|_| g.next_uniform_f64() < keep_prob)
                .collect()
        });

        // Expand channel mask to full element mask.
        let scaled_mask: Vec<T> = {
            let mut mask = Vec::with_capacity(numel);
            for &cm in &channel_mask {
                let val = if cm { scale } else { zero };
                for _ in 0..spatial {
                    mask.push(val);
                }
            }
            mask
        };

        let input_data = input.data_vec()?;
        let output_data: Vec<T> = input_data
            .iter()
            .zip(scaled_mask.iter())
            .map(|(&x, &m)| x * m)
            .collect();

        // In-place branch mirrors `_VF.feature_dropout_` at
        // `torch/nn/functional.py:1579`. Routed through the autograd-safe
        // policy (`apply_inplace_dropout`): errors on a grad-requiring leaf,
        // falls back to out-of-place on a grad-requiring non-leaf, mutates only
        // when no autograd node observes the storage.
        if self.inplace {
            apply_inplace_dropout(input, &output_data)?;
        }

        let result = if is_grad_enabled() && input.requires_grad() {
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(Dropout2dBackward {
                    input: input.clone(),
                    scaled_mask,
                }),
            )?
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )?
        };
        Ok(result)
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// Dropout1d — CL-433
// ===========================================================================

/// Randomly zeroes entire 1D channels with probability `p` during training.
///
/// Expects input of shape `[B, C, L]` (3 dimensions). During training,
/// each channel (the entire length-`L` slice for a given `b, c`) is
/// independently set to zero with probability `p` and surviving channels
/// are scaled by `1/(1-p)`. During evaluation the input is returned unchanged.
///
/// This is the 1D analogue of [`Dropout2d`].
///
/// Matches `torch.nn.Dropout1d`.
#[derive(Debug)]
pub struct Dropout1d<T: Float> {
    p: f64,
    training: bool,
    /// In-place flag, mirroring `_DropoutNd.inplace` at
    /// `torch/nn/modules/dropout.py:29` and the `inplace` branch of
    /// `F.dropout1d` at `torch/nn/functional.py:1515-1519`
    /// (`_VF.feature_dropout_(input, p, training) if inplace`).
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

impl<T: Float> Dropout1d<T> {
    /// Create a new `Dropout1d` layer.
    ///
    /// `p` is the probability of an entire channel being zeroed. Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("dropout1d probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag, mirroring `torch.nn.Dropout1d(p, inplace=...)`.
    /// When `true`, training-mode forward mutates the input storage.
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

impl<T: Float> Module<T> for Dropout1d<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        let shape = input.shape();
        if shape.len() != 3 {
            return Err(FerrotorchError::InvalidArgument {
                message: format!(
                    "Dropout1d expects 3D input [B, C, L], got shape {:?}",
                    shape
                ),
            });
        }

        let batch = shape[0];
        let channels = shape[1];
        let length = shape[2];

        let numel = input.numel();
        let scale = T::from(1.0 / (1.0 - self.p)).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        if input.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda { op: "Dropout1d" });
        }

        // Per-channel keep mask from the byte-exact MT19937 `Generator`,
        // matching torch's `make_feature_noise(input).bernoulli_(1 - p)`
        // (`aten/src/ATen/native/Dropout.cpp:73-74`): one Bernoulli draw per
        // `[N, C]` channel in flat order, keep iff `next_uniform_f64() < (1-p)`,
        // broadcast over the length-`L` dim, survivors scaled by `1/(1-p)`.
        // Reproducible under `ferrotorch_core::manual_seed` (#1635).
        let keep_prob = 1.0 - self.p;
        let channel_mask: Vec<bool> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..batch * channels)
                .map(|_| g.next_uniform_f64() < keep_prob)
                .collect()
        });

        let scaled_mask: Vec<T> = {
            let mut mask = Vec::with_capacity(numel);
            for &cm in &channel_mask {
                let val = if cm { scale } else { zero };
                for _ in 0..length {
                    mask.push(val);
                }
            }
            mask
        };

        let input_data = input.data_vec()?;
        let output_data: Vec<T> = input_data
            .iter()
            .zip(scaled_mask.iter())
            .map(|(&x, &m)| x * m)
            .collect();

        // In-place branch mirrors `_VF.feature_dropout_` at
        // `torch/nn/functional.py:1516`. Routed through the autograd-safe
        // policy (`apply_inplace_dropout`): errors on a grad-requiring leaf,
        // falls back to out-of-place on a grad-requiring non-leaf, mutates only
        // when no autograd node observes the storage.
        if self.inplace {
            apply_inplace_dropout(input, &output_data)?;
        }

        let result = if is_grad_enabled() && input.requires_grad() {
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(Dropout2dBackward {
                    input: input.clone(),
                    scaled_mask,
                }),
            )?
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )?
        };
        Ok(result)
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// Dropout3d — CL-433
// ===========================================================================

/// Randomly zeroes entire 3D channels with probability `p` during training.
///
/// Expects input of shape `[B, C, D, H, W]` (5 dimensions). During training,
/// each channel (the entire `D * H * W` volume for a given `b, c`) is
/// independently set to zero with probability `p` and surviving channels
/// are scaled by `1/(1-p)`. During evaluation the input is returned unchanged.
///
/// Matches `torch.nn.Dropout3d`.
#[derive(Debug)]
pub struct Dropout3d<T: Float> {
    p: f64,
    training: bool,
    /// In-place flag, mirroring `_DropoutNd.inplace` at
    /// `torch/nn/modules/dropout.py:29` and the `inplace` branch of
    /// `F.dropout3d` at `torch/nn/functional.py:1628-1632`
    /// (`_VF.feature_dropout_(input, p, training) if inplace`).
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

impl<T: Float> Dropout3d<T> {
    /// Create a new `Dropout3d` layer.
    ///
    /// `p` is the probability of an entire channel being zeroed. Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("dropout3d probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag, mirroring `torch.nn.Dropout3d(p, inplace=...)`.
    /// When `true`, training-mode forward mutates the input storage.
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

impl<T: Float> Module<T> for Dropout3d<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        let shape = input.shape();
        if shape.len() != 5 {
            return Err(FerrotorchError::InvalidArgument {
                message: format!(
                    "Dropout3d expects 5D input [B, C, D, H, W], got shape {:?}",
                    shape
                ),
            });
        }

        let batch = shape[0];
        let channels = shape[1];
        let spatial: usize = shape[2..].iter().product();

        let numel = input.numel();
        let scale = T::from(1.0 / (1.0 - self.p)).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        if input.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda { op: "Dropout3d" });
        }

        // Per-channel keep mask from the byte-exact MT19937 `Generator`,
        // matching torch's `make_feature_noise(input).bernoulli_(1 - p)`
        // (`aten/src/ATen/native/Dropout.cpp:73-74`): one Bernoulli draw per
        // `[N, C]` channel in flat order, keep iff `next_uniform_f64() < (1-p)`,
        // broadcast over the `D*H*W` volume, survivors scaled by `1/(1-p)`.
        // Reproducible under `ferrotorch_core::manual_seed` (#1635).
        let keep_prob = 1.0 - self.p;
        let channel_mask: Vec<bool> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..batch * channels)
                .map(|_| g.next_uniform_f64() < keep_prob)
                .collect()
        });

        let scaled_mask: Vec<T> = {
            let mut mask = Vec::with_capacity(numel);
            for &cm in &channel_mask {
                let val = if cm { scale } else { zero };
                for _ in 0..spatial {
                    mask.push(val);
                }
            }
            mask
        };

        let input_data = input.data_vec()?;
        let output_data: Vec<T> = input_data
            .iter()
            .zip(scaled_mask.iter())
            .map(|(&x, &m)| x * m)
            .collect();

        // In-place branch mirrors `_VF.feature_dropout_` at
        // `torch/nn/functional.py:1629`. Routed through the autograd-safe
        // policy (`apply_inplace_dropout`): errors on a grad-requiring leaf,
        // falls back to out-of-place on a grad-requiring non-leaf, mutates only
        // when no autograd node observes the storage.
        if self.inplace {
            apply_inplace_dropout(input, &output_data)?;
        }

        let result = if is_grad_enabled() && input.requires_grad() {
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(Dropout2dBackward {
                    input: input.clone(),
                    scaled_mask,
                }),
            )?
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )?
        };
        Ok(result)
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// AlphaDropout — CL-433
// ===========================================================================

/// Alpha Dropout for use with SELU activations.
///
/// Unlike standard dropout, `AlphaDropout` preserves the self-normalizing
/// property of SELU by maintaining the mean and variance of the input.
/// Dropped elements are set to the SELU saturation value rather than zero,
/// and the output is affinely transformed to restore the original mean and
/// variance.
///
/// During training, mirroring `aten/src/ATen/native/Dropout.cpp:74-79`:
/// 1. A per-element Bernoulli keep-mask is drawn at probability `1 - p` from
///    the byte-exact MT19937 `Generator` (keep iff `next_uniform_f64() < 1-p`).
/// 2. With `alpha = 1.7580993408473766` and
///    `a = 1/sqrt((alpha^2 * p + 1) * (1 - p))`:
///    - kept elements map to `a*x + alpha*a*p`,
///    - dropped elements map to the constant `-alpha*a + alpha*a*p`.
///
/// During evaluation, the input is returned unchanged.
///
/// Matches `torch.nn.AlphaDropout`. Reproducible under
/// `ferrotorch_core::manual_seed` (#1636).
#[derive(Debug)]
pub struct AlphaDropout<T: Float> {
    p: f64,
    training: bool,
    /// In-place flag, carried for API parity with `_DropoutNd.inplace`
    /// (`torch/nn/modules/dropout.py:29`).
    ///
    /// NOTE — faithful upstream behaviour: `AlphaDropout.forward` at
    /// `torch/nn/modules/dropout.py:265-269` calls
    /// `F.alpha_dropout(input, self.p, self.training)` and does **not** pass
    /// `self.inplace`, so torch's `nn.AlphaDropout(p, inplace=True)` does NOT
    /// mutate in place at the module level — the `inplace` field exists on the
    /// struct (inherited from `_DropoutNd.__init__`) but the module forward
    /// drops it. We mirror that exactly: the field is stored for ABI parity,
    /// but [`AlphaDropout::forward`] never mutates the input. (The functional
    /// `F.alpha_dropout` does accept `inplace`, but the module never forwards
    /// it.)
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

/// The alpha-dropout affine constant torch hardcodes at
/// `aten/src/ATen/native/Dropout.cpp:76`
/// (`constexpr double alpha = 1.7580993408473766;`). This is the SELU-derived
/// `lambda * alpha` magnitude, but used VERBATIM as torch's literal — NOT
/// recomputed as `SELU_LAMBDA * SELU_ALPHA`, which differs in the last ULPs and
/// would shift the affine away from torch byte-for-byte (#1636).
const ALPHA_DROPOUT_ALPHA: f64 = 1.7580993408473766;

impl<T: Float> AlphaDropout<T> {
    /// Create a new `AlphaDropout` layer.
    ///
    /// `p` is the probability of an element being dropped. Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("alpha_dropout probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag for API parity with
    /// `torch.nn.AlphaDropout(p, inplace=...)`.
    ///
    /// Like upstream, the module `forward` does NOT mutate in place even when
    /// this is `true` — `torch.nn.AlphaDropout.forward` never forwards
    /// `self.inplace` to `F.alpha_dropout` (`dropout.py:265-269`). The flag is
    /// retained so the constructor surface matches torch field-for-field.
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

/// Backward node for AlphaDropout.
///
/// The affine correction factor `a` is baked into the scaled_mask:
/// surviving elements get `a`, dropped elements get `0`.
/// Gradient routing: grad_input = grad_output * scaled_mask.
#[derive(Debug)]
struct AlphaDropoutBackward<T: Float> {
    input: Tensor<T>,
    /// Mask with `a` for kept elements and `0` for dropped elements.
    grad_mask: Vec<T>,
}

impl<T: Float> GradFn<T> for AlphaDropoutBackward<T> {
    fn backward(&self, grad_output: &Tensor<T>) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
        if grad_output.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda {
                op: "AlphaDropout backward",
            });
        }
        let da = if self.input.requires_grad() {
            let go_data = grad_output.data_vec()?;
            let grad_a: Vec<T> = go_data
                .iter()
                .zip(self.grad_mask.iter())
                .map(|(&g, &m)| g * m)
                .collect();
            let g = Tensor::from_storage(
                TensorStorage::cpu(grad_a),
                self.input.shape().to_vec(),
                false,
            )?;
            Some(g)
        } else {
            None
        };
        Ok(vec![da])
    }

    fn inputs(&self) -> Vec<&Tensor<T>> {
        vec![&self.input]
    }

    fn name(&self) -> &'static str {
        "AlphaDropoutBackward"
    }
}

impl<T: Float> Module<T> for AlphaDropout<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        if input.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda { op: "AlphaDropout" });
        }

        let numel = input.numel();
        let p = self.p;

        // torch's EXACT alpha affine, `aten/src/ATen/native/Dropout.cpp:74-79`:
        //   noise.bernoulli_(1 - p)                 // 1.0 kept, 0.0 dropped
        //   constexpr double alpha = 1.7580993408473766;
        //   double a = 1. / sqrt((alpha*alpha*p + 1) * (1 - p));
        //   b = noise.add(-1).mul_(alpha*a).add_(alpha*a*p);
        //   noise.mul_(a);                          // a kept, 0 dropped
        //   out = input * noise + b
        // Folding the per-element `b = (noise-1)*alpha*a + alpha*a*p`:
        //   kept  (noise=1): out = a*x + alpha*a*p
        //   dropped(noise=0): out = -alpha*a + alpha*a*p   (constant in x)
        // We use torch's hardcoded `alpha` constant verbatim — NOT a recomputed
        // `-SELU_LAMBDA*SELU_ALPHA` (= -1.7580993..., same magnitude but the
        // recomputed value diverges in the last ULPs and changes the affine).
        let alpha = ALPHA_DROPOUT_ALPHA;
        let a_f64 = 1.0 / ((alpha * alpha * p + 1.0) * (1.0 - p)).sqrt();
        let dropped_f64 = -alpha * a_f64 + alpha * a_f64 * p;
        let kept_b_f64 = alpha * a_f64 * p;

        let a = T::from(a_f64).unwrap();
        let kept_b = T::from(kept_b_f64).unwrap();
        let dropped_v = T::from(dropped_f64).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        // Per-element keep mask from the byte-exact MT19937 `Generator`,
        // matching `at::empty_like(input).bernoulli_(1 - p)` (alpha_dropout is
        // element-wise, NOT feature noise; `Dropout.cpp:73`). Keep iff
        // `next_uniform_f64() < (1 - p)`; reproducible under
        // `ferrotorch_core::manual_seed` (#1636).
        let keep_prob = 1.0 - p;
        let keep: Vec<bool> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..numel)
                .map(|_| g.next_uniform_f64() < keep_prob)
                .collect()
        });

        let input_data = input.data()?;
        let mut output_data = Vec::with_capacity(numel);
        let mut grad_mask = Vec::with_capacity(numel);

        for (i, &x) in input_data.iter().enumerate() {
            if keep[i] {
                // Kept element: a * x + alpha*a*p
                output_data.push(a * x + kept_b);
                grad_mask.push(a);
            } else {
                // Dropped element: -alpha*a + alpha*a*p (independent of x).
                output_data.push(dropped_v);
                grad_mask.push(zero);
            }
        }

        if is_grad_enabled() && input.requires_grad() {
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(AlphaDropoutBackward {
                    input: input.clone(),
                    grad_mask,
                }),
            )
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )
        }
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// FeatureAlphaDropout — closes #1448
// ===========================================================================

/// Randomly masks entire feature-channels with the SELU saturation value
/// during training, mirroring `torch.nn.FeatureAlphaDropout`
/// (`torch/nn/modules/dropout.py:233-281`).
///
/// Unlike [`AlphaDropout`], which drops individual elements, this layer
/// drops every spatial position within a `(b, c)` feature-channel as a unit
/// — the dropout decision is sampled once per channel and broadcast over
/// the trailing spatial dims. Used in self-normalising convolutional
/// networks where per-feature decorrelation must be preserved while
/// maintaining mean/variance.
///
/// During training, mirroring `aten/src/ATen/native/Dropout.cpp:73-79`
/// (`_dropout_impl<feature=true, alpha=true>`):
/// 1. A per-channel Bernoulli keep-mask is drawn over the reduced
///    `[N, C, 1, 1...]` noise tensor at probability `1 - p` from the
///    byte-exact MT19937 `Generator` (keep iff `next_uniform_f64() < 1-p`),
///    in flat `[N, C]` order, then broadcast over the spatial volume.
/// 2. With `alpha = 1.7580993408473766` and
///    `a = 1/sqrt((alpha^2 * p + 1) * (1 - p))`, kept channels map to
///    `a*x + alpha*a*p` and dropped channels to `-alpha*a + alpha*a*p`.
///
/// During evaluation, the input is returned unchanged.
///
/// Expects input of shape `[N, C, *]` (at least 2-D). Reproducible under
/// `ferrotorch_core::manual_seed` (#1636).
#[derive(Debug)]
pub struct FeatureAlphaDropout<T: Float> {
    p: f64,
    training: bool,
    /// In-place flag, carried for API parity with `_DropoutNd.inplace`
    /// (`torch/nn/modules/dropout.py:29`).
    ///
    /// NOTE — faithful upstream behaviour: `FeatureAlphaDropout.forward` at
    /// `torch/nn/modules/dropout.py:319-323` calls
    /// `F.feature_alpha_dropout(input, self.p, self.training)` and does **not**
    /// pass `self.inplace`, so torch's `nn.FeatureAlphaDropout(p,
    /// inplace=True)` does NOT mutate in place at the module level. We mirror
    /// that exactly: the field is stored for ABI parity, but
    /// [`FeatureAlphaDropout::forward`] never mutates the input.
    inplace: bool,
    _marker: std::marker::PhantomData<T>,
}

impl<T: Float> FeatureAlphaDropout<T> {
    /// Create a new `FeatureAlphaDropout` layer.
    ///
    /// `p` is the probability of an entire feature-channel being dropped.
    /// Must be in `[0, 1)`.
    pub fn new(p: f64) -> FerrotorchResult<Self> {
        if !(0.0..1.0).contains(&p) {
            return Err(FerrotorchError::InvalidArgument {
                message: format!("feature_alpha_dropout probability must be in [0, 1), got {p}"),
            });
        }
        Ok(Self {
            p,
            training: true,
            inplace: false,
            _marker: std::marker::PhantomData,
        })
    }

    /// Set the `inplace` flag for API parity with
    /// `torch.nn.FeatureAlphaDropout(p, inplace=...)`.
    ///
    /// Like upstream, the module `forward` does NOT mutate in place even when
    /// this is `true` — `torch.nn.FeatureAlphaDropout.forward` never forwards
    /// `self.inplace` to `F.feature_alpha_dropout` (`dropout.py:319-323`).
    #[must_use]
    pub fn with_inplace(mut self, inplace: bool) -> Self {
        self.inplace = inplace;
        self
    }

    /// Returns the `inplace` flag.
    pub fn inplace(&self) -> bool {
        self.inplace
    }
}

/// Backward node for `FeatureAlphaDropout`.
///
/// The affine factor `a` is baked into the broadcast mask: kept channels
/// receive `a`, dropped channels receive `0`. Gradient routes as
/// `grad_input = grad_output * grad_mask`.
#[derive(Debug)]
struct FeatureAlphaDropoutBackward<T: Float> {
    input: Tensor<T>,
    /// Full-shape mask with `a` for kept channels, `0` for dropped.
    grad_mask: Vec<T>,
}

impl<T: Float> GradFn<T> for FeatureAlphaDropoutBackward<T> {
    fn backward(&self, grad_output: &Tensor<T>) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
        if grad_output.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda {
                op: "FeatureAlphaDropout backward",
            });
        }
        let da = if self.input.requires_grad() {
            let go_data = grad_output.data_vec()?;
            let grad_a: Vec<T> = go_data
                .iter()
                .zip(self.grad_mask.iter())
                .map(|(&g, &m)| g * m)
                .collect();
            let g = Tensor::from_storage(
                TensorStorage::cpu(grad_a),
                self.input.shape().to_vec(),
                false,
            )?;
            Some(g)
        } else {
            None
        };
        Ok(vec![da])
    }

    fn inputs(&self) -> Vec<&Tensor<T>> {
        vec![&self.input]
    }

    fn name(&self) -> &'static str {
        "FeatureAlphaDropoutBackward"
    }
}

impl<T: Float> Module<T> for FeatureAlphaDropout<T> {
    fn forward(&self, input: &Tensor<T>) -> FerrotorchResult<Tensor<T>> {
        if !self.training || self.p == 0.0 {
            return Ok(input.clone());
        }

        let shape = input.shape();
        if shape.len() < 2 {
            return Err(FerrotorchError::InvalidArgument {
                message: format!(
                    "FeatureAlphaDropout expects at least 2D input [N, C, ...], got shape {:?}",
                    shape
                ),
            });
        }

        if input.is_cuda() {
            return Err(FerrotorchError::NotImplementedOnCuda {
                op: "FeatureAlphaDropout",
            });
        }

        let batch = shape[0];
        let channels = shape[1];
        // Spatial dims (D, H, W, ...). For a 2-D `[N, C]` input the product
        // of the empty suffix is 1, matching torch's broadcast behaviour.
        let spatial: usize = shape[2..].iter().product();

        let numel = input.numel();
        let p = self.p;

        // torch's EXACT feature-alpha affine: `feature_alpha_dropout` calls
        // `_dropout_impl<feature=true, alpha=true>`, so the noise is a
        // PER-CHANNEL `make_feature_noise` tensor (`Dropout.cpp:73`) drawn with
        // `bernoulli_(1 - p)`, then the alpha affine
        // (`Dropout.cpp:76-79`): `alpha = 1.7580993408473766`,
        // `a = 1/sqrt((alpha^2*p + 1)*(1-p))`,
        // kept (noise=1) -> `a*x + alpha*a*p`,
        // dropped (noise=0) -> `-alpha*a + alpha*a*p` (constant in x).
        let alpha = ALPHA_DROPOUT_ALPHA;
        let a_f64 = 1.0 / ((alpha * alpha * p + 1.0) * (1.0 - p)).sqrt();
        let dropped_f64 = -alpha * a_f64 + alpha * a_f64 * p;
        let kept_b_f64 = alpha * a_f64 * p;

        let a = T::from(a_f64).unwrap();
        let kept_b = T::from(kept_b_f64).unwrap();
        let dropped_v = T::from(dropped_f64).unwrap();
        let zero = <T as num_traits::Zero>::zero();

        // Per-channel keep mask: one Bernoulli draw per `[N, C]` entry in flat
        // order from the byte-exact MT19937 `Generator`, keep iff
        // `next_uniform_f64() < (1 - p)`, broadcast over the trailing spatial
        // volume. Reproducible under `ferrotorch_core::manual_seed` (#1636).
        let keep_prob = 1.0 - p;
        let keep_channel: Vec<bool> = ferrotorch_core::rng::with_thread_rng(|g| {
            (0..batch * channels)
                .map(|_| g.next_uniform_f64() < keep_prob)
                .collect()
        });

        let input_data = input.data_vec()?;
        let mut output_data = Vec::with_capacity(numel);
        let mut grad_mask = Vec::with_capacity(numel);

        // For each channel: emit `spatial` masked elements at once.
        for bc in 0..batch * channels {
            let keep = keep_channel[bc];
            let base = bc * spatial;
            for s in 0..spatial {
                let x = input_data[base + s];
                if keep {
                    output_data.push(a * x + kept_b);
                    grad_mask.push(a);
                } else {
                    output_data.push(dropped_v);
                    grad_mask.push(zero);
                }
            }
        }

        if is_grad_enabled() && input.requires_grad() {
            Tensor::from_operation(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                Arc::new(FeatureAlphaDropoutBackward {
                    input: input.clone(),
                    grad_mask,
                }),
            )
        } else {
            Tensor::from_storage(
                TensorStorage::cpu(output_data),
                input.shape().to_vec(),
                false,
            )
        }
    }

    fn parameters(&self) -> Vec<&Parameter<T>> {
        vec![]
    }

    fn parameters_mut(&mut self) -> Vec<&mut Parameter<T>> {
        vec![]
    }

    fn named_parameters(&self) -> Vec<(String, &Parameter<T>)> {
        vec![]
    }

    fn train(&mut self) {
        self.training = true;
    }

    fn eval(&mut self) {
        self.training = false;
    }

    fn is_training(&self) -> bool {
        self.training
    }
}

// ===========================================================================
// Tests
// ===========================================================================

#[cfg(test)]
mod tests {
    use super::*;

    /// Create a leaf tensor with given data and shape.
    fn leaf_tensor(data: &[f32], shape: &[usize], requires_grad: bool) -> Tensor<f32> {
        Tensor::from_storage(
            TensorStorage::cpu(data.to_vec()),
            shape.to_vec(),
            requires_grad,
        )
        .unwrap()
    }

    // -----------------------------------------------------------------------
    // Dropout
    // -----------------------------------------------------------------------

    #[test]
    fn test_dropout_rate_approximately_correct() {
        let d = Dropout::<f32>::new(0.5).unwrap();
        let input = ferrotorch_core::ones::<f32>(&[100_000]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        // Count zeros — should be roughly 50%.
        let zeros = data.iter().filter(|&&x| x == 0.0).count();
        let rate = zeros as f64 / data.len() as f64;
        assert!(
            (rate - 0.5).abs() < 0.05,
            "dropout rate = {rate}, expected ~0.5"
        );

        // Surviving elements should be scaled by 1/(1-0.5) = 2.0.
        let non_zero: Vec<f32> = data.iter().copied().filter(|&x| x != 0.0).collect();
        assert!(!non_zero.is_empty());
        for &v in &non_zero {
            assert!(
                (v - 2.0).abs() < 1e-6,
                "surviving element = {v}, expected 2.0"
            );
        }
    }

    #[test]
    fn test_dropout_eval_is_identity() {
        let mut d = Dropout::<f32>::new(0.5).unwrap();
        d.eval();
        assert!(!d.is_training());

        let input = ferrotorch_core::ones::<f32>(&[100]).unwrap();
        let output = d.forward(&input).unwrap();

        // In eval mode the output should be the exact same Arc (identity).
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_dropout_zero_prob_is_identity() {
        let d = Dropout::<f32>::new(0.0).unwrap();
        let input = ferrotorch_core::ones::<f32>(&[100]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_dropout_invalid_p() {
        assert!(Dropout::<f32>::new(1.0).is_err());
        assert!(Dropout::<f32>::new(-0.1).is_err());
        assert!(Dropout::<f32>::new(1.5).is_err());
    }

    #[test]
    fn test_dropout_backward_routes_through_surviving() {
        let d = Dropout::<f32>::new(0.5).unwrap();
        let input = leaf_tensor(&[1.0; 1000], &[1000], true);
        let output = d.forward(&input).unwrap();

        // To backward we need a scalar loss. Sum the output manually.
        let out_data = output.data().unwrap().to_vec();
        let total: f32 = out_data.iter().sum();

        // Build a SumBackward so we can call backward.
        #[derive(Debug)]
        struct SumBackward<T: Float> {
            input: Tensor<T>,
        }
        impl<T: Float> GradFn<T> for SumBackward<T> {
            fn backward(
                &self,
                _grad_output: &Tensor<T>,
            ) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
                let ones = vec![<T as num_traits::One>::one(); self.input.numel()];
                let t = Tensor::from_storage(
                    TensorStorage::cpu(ones),
                    self.input.shape().to_vec(),
                    false,
                )?;
                Ok(vec![Some(t)])
            }
            fn inputs(&self) -> Vec<&Tensor<T>> {
                vec![&self.input]
            }
            fn name(&self) -> &'static str {
                "SumBackward"
            }
        }

        let loss = Tensor::from_operation(
            TensorStorage::cpu(vec![total]),
            vec![],
            Arc::new(SumBackward {
                input: output.clone(),
            }),
        )
        .unwrap();
        loss.backward().unwrap();

        let grad = input.grad().unwrap().unwrap();
        let grad_data = grad.data().unwrap();

        // Every gradient element should be either 0 (dropped) or 1/(1-p) = 2.0 (survived).
        for &g in grad_data {
            assert!(
                g == 0.0 || (g - 2.0).abs() < 1e-6,
                "gradient element = {g}, expected 0.0 or 2.0"
            );
        }

        // The dropout mask for forward and backward should match: output zero
        // iff gradient zero.
        let out_data = output.data().unwrap();
        for (i, (&o, &g)) in out_data.iter().zip(grad_data.iter()).enumerate() {
            assert_eq!(
                o == 0.0,
                g == 0.0,
                "mismatch at index {i}: output={o}, grad={g}"
            );
        }
    }

    #[test]
    fn test_dropout_no_parameters() {
        let d = Dropout::<f32>::new(0.3).unwrap();
        assert!(d.parameters().is_empty());
        assert!(d.named_parameters().is_empty());
    }

    #[test]
    fn test_dropout_train_eval_toggle() {
        let mut d = Dropout::<f32>::new(0.5).unwrap();
        assert!(d.is_training());
        d.eval();
        assert!(!d.is_training());
        d.train();
        assert!(d.is_training());
    }

    #[test]
    fn test_dropout_is_send_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<Dropout<f32>>();
        assert_send_sync::<Dropout<f64>>();
    }

    // -----------------------------------------------------------------------
    // Dropout2d
    // -----------------------------------------------------------------------

    #[test]
    fn test_dropout2d_drops_whole_channels() {
        let d = Dropout2d::<f32>::new(0.5).unwrap();
        // Shape: [2, 10, 4, 4] — 2 batches, 10 channels, 4x4 spatial.
        let input = ferrotorch_core::ones::<f32>(&[2, 10, 4, 4]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let spatial = 4 * 4;
        // Check that each channel is either entirely zero or entirely scaled.
        for b in 0..2 {
            for c in 0..10 {
                let start = (b * 10 + c) * spatial;
                let end = start + spatial;
                let channel = &data[start..end];

                let first = channel[0];
                assert!(
                    channel.iter().all(|&x| (x - first).abs() < 1e-6),
                    "channel (b={b}, c={c}) is not uniform: first={first}, channel={channel:?}"
                );
                // Value should be 0 or 1/(1-0.5) = 2.0.
                assert!(
                    first == 0.0 || (first - 2.0).abs() < 1e-6,
                    "channel value = {first}, expected 0.0 or 2.0"
                );
            }
        }
    }

    #[test]
    fn test_dropout2d_rate_approximately_correct() {
        let d = Dropout2d::<f32>::new(0.5).unwrap();
        // Many channels to get a good statistical sample.
        let input = ferrotorch_core::ones::<f32>(&[1, 1000, 2, 2]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let spatial = 2 * 2;
        let mut dropped = 0;
        for c in 0..1000 {
            let start = c * spatial;
            if data[start] == 0.0 {
                dropped += 1;
            }
        }
        let rate = dropped as f64 / 1000.0;
        assert!(
            (rate - 0.5).abs() < 0.05,
            "dropout2d rate = {rate}, expected ~0.5"
        );
    }

    #[test]
    fn test_dropout2d_eval_is_identity() {
        let mut d = Dropout2d::<f32>::new(0.5).unwrap();
        d.eval();
        let input = ferrotorch_core::ones::<f32>(&[2, 3, 4, 4]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_dropout2d_invalid_p() {
        assert!(Dropout2d::<f32>::new(1.0).is_err());
        assert!(Dropout2d::<f32>::new(-0.1).is_err());
    }

    #[test]
    fn test_dropout2d_requires_2d_input() {
        let d = Dropout2d::<f32>::new(0.3).unwrap();
        let input_1d = ferrotorch_core::ones::<f32>(&[10]).unwrap();
        assert!(d.forward(&input_1d).is_err());
    }

    #[test]
    fn test_dropout2d_backward_routes_through_surviving_channels() {
        let d = Dropout2d::<f32>::new(0.5).unwrap();
        // [1, 20, 3, 3]
        let input = leaf_tensor(&[1.0; 20 * 3 * 3], &[1, 20, 3, 3], true);
        let output = d.forward(&input).unwrap();

        let out_data = output.data().unwrap().to_vec();
        let total: f32 = out_data.iter().sum();

        #[derive(Debug)]
        struct SumBackward<T: Float> {
            input: Tensor<T>,
        }
        impl<T: Float> GradFn<T> for SumBackward<T> {
            fn backward(
                &self,
                _grad_output: &Tensor<T>,
            ) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
                let ones = vec![<T as num_traits::One>::one(); self.input.numel()];
                let t = Tensor::from_storage(
                    TensorStorage::cpu(ones),
                    self.input.shape().to_vec(),
                    false,
                )?;
                Ok(vec![Some(t)])
            }
            fn inputs(&self) -> Vec<&Tensor<T>> {
                vec![&self.input]
            }
            fn name(&self) -> &'static str {
                "SumBackward"
            }
        }

        let loss = Tensor::from_operation(
            TensorStorage::cpu(vec![total]),
            vec![],
            Arc::new(SumBackward {
                input: output.clone(),
            }),
        )
        .unwrap();
        loss.backward().unwrap();

        let grad = input.grad().unwrap().unwrap();
        let grad_data = grad.data().unwrap();
        let out_data = output.data().unwrap();

        // Gradient mask must match output mask.
        for (i, (&o, &g)) in out_data.iter().zip(grad_data.iter()).enumerate() {
            assert_eq!(
                o == 0.0,
                g == 0.0,
                "mismatch at index {i}: output={o}, grad={g}"
            );
        }

        // Gradients should be channel-uniform.
        let spatial = 3 * 3;
        for c in 0..20 {
            let start = c * spatial;
            let end = start + spatial;
            let channel_grad = &grad_data[start..end];
            let first = channel_grad[0];
            assert!(
                channel_grad.iter().all(|&g| (g - first).abs() < 1e-6),
                "gradient channel {c} is not uniform"
            );
        }
    }

    #[test]
    fn test_dropout2d_no_parameters() {
        let d = Dropout2d::<f32>::new(0.3).unwrap();
        assert!(d.parameters().is_empty());
        assert!(d.named_parameters().is_empty());
    }

    #[test]
    fn test_dropout2d_is_send_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<Dropout2d<f32>>();
        assert_send_sync::<Dropout2d<f64>>();
    }

    // -----------------------------------------------------------------------
    // Dropout1d — CL-433
    // -----------------------------------------------------------------------

    #[test]
    fn test_dropout1d_drops_whole_channels() {
        let d = Dropout1d::<f32>::new(0.5).unwrap();
        // Shape: [2, 10, 8] — 2 batches, 10 channels, length 8.
        let input = ferrotorch_core::ones::<f32>(&[2, 10, 8]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let length = 8;
        for b in 0..2 {
            for c in 0..10 {
                let start = (b * 10 + c) * length;
                let end = start + length;
                let channel = &data[start..end];

                let first = channel[0];
                assert!(
                    channel.iter().all(|&x| (x - first).abs() < 1e-6),
                    "channel (b={b}, c={c}) is not uniform"
                );
                assert!(
                    first == 0.0 || (first - 2.0).abs() < 1e-6,
                    "channel value = {first}, expected 0.0 or 2.0"
                );
            }
        }
    }

    #[test]
    fn test_dropout1d_rate_approximately_correct() {
        let d = Dropout1d::<f32>::new(0.5).unwrap();
        let input = ferrotorch_core::ones::<f32>(&[1, 1000, 4]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let length = 4;
        let mut dropped = 0;
        for c in 0..1000 {
            if data[c * length] == 0.0 {
                dropped += 1;
            }
        }
        let rate = dropped as f64 / 1000.0;
        assert!(
            (rate - 0.5).abs() < 0.05,
            "dropout1d rate = {rate}, expected ~0.5"
        );
    }

    #[test]
    fn test_dropout1d_eval_is_identity() {
        let mut d = Dropout1d::<f32>::new(0.5).unwrap();
        d.eval();
        let input = ferrotorch_core::ones::<f32>(&[2, 3, 8]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_dropout1d_invalid_p() {
        assert!(Dropout1d::<f32>::new(1.0).is_err());
        assert!(Dropout1d::<f32>::new(-0.1).is_err());
    }

    #[test]
    fn test_dropout1d_requires_3d_input() {
        let d = Dropout1d::<f32>::new(0.3).unwrap();
        let input_2d = ferrotorch_core::ones::<f32>(&[10, 5]).unwrap();
        assert!(d.forward(&input_2d).is_err());
    }

    #[test]
    fn test_dropout1d_no_parameters() {
        let d = Dropout1d::<f32>::new(0.3).unwrap();
        assert!(d.parameters().is_empty());
    }

    #[test]
    fn test_dropout1d_is_send_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<Dropout1d<f32>>();
        assert_send_sync::<Dropout1d<f64>>();
    }

    // -----------------------------------------------------------------------
    // Dropout3d — CL-433
    // -----------------------------------------------------------------------

    #[test]
    fn test_dropout3d_drops_whole_channels() {
        let d = Dropout3d::<f32>::new(0.5).unwrap();
        // Shape: [2, 10, 2, 2, 2] — 2 batches, 10 channels, 2x2x2 spatial.
        let input = ferrotorch_core::ones::<f32>(&[2, 10, 2, 2, 2]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let spatial = 2 * 2 * 2;
        for b in 0..2 {
            for c in 0..10 {
                let start = (b * 10 + c) * spatial;
                let end = start + spatial;
                let channel = &data[start..end];

                let first = channel[0];
                assert!(
                    channel.iter().all(|&x| (x - first).abs() < 1e-6),
                    "channel (b={b}, c={c}) is not uniform"
                );
                assert!(
                    first == 0.0 || (first - 2.0).abs() < 1e-6,
                    "channel value = {first}, expected 0.0 or 2.0"
                );
            }
        }
    }

    #[test]
    fn test_dropout3d_rate_approximately_correct() {
        let d = Dropout3d::<f32>::new(0.5).unwrap();
        let input = ferrotorch_core::ones::<f32>(&[1, 1000, 2, 2, 2]).unwrap();
        let output = d.forward(&input).unwrap();
        let data = output.data().unwrap();

        let spatial = 2 * 2 * 2;
        let mut dropped = 0;
        for c in 0..1000 {
            if data[c * spatial] == 0.0 {
                dropped += 1;
            }
        }
        let rate = dropped as f64 / 1000.0;
        assert!(
            (rate - 0.5).abs() < 0.05,
            "dropout3d rate = {rate}, expected ~0.5"
        );
    }

    #[test]
    fn test_dropout3d_eval_is_identity() {
        let mut d = Dropout3d::<f32>::new(0.5).unwrap();
        d.eval();
        let input = ferrotorch_core::ones::<f32>(&[2, 3, 2, 2, 2]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_dropout3d_invalid_p() {
        assert!(Dropout3d::<f32>::new(1.0).is_err());
        assert!(Dropout3d::<f32>::new(-0.1).is_err());
    }

    #[test]
    fn test_dropout3d_requires_5d_input() {
        let d = Dropout3d::<f32>::new(0.3).unwrap();
        let input_4d = ferrotorch_core::ones::<f32>(&[2, 3, 4, 4]).unwrap();
        assert!(d.forward(&input_4d).is_err());
    }

    #[test]
    fn test_dropout3d_no_parameters() {
        let d = Dropout3d::<f32>::new(0.3).unwrap();
        assert!(d.parameters().is_empty());
    }

    #[test]
    fn test_dropout3d_is_send_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<Dropout3d<f32>>();
        assert_send_sync::<Dropout3d<f64>>();
    }

    // -----------------------------------------------------------------------
    // AlphaDropout — CL-433
    // -----------------------------------------------------------------------

    #[test]
    fn test_alpha_dropout_preserves_mean_approx() {
        // With large sample, mean should be approximately preserved.
        let d = AlphaDropout::<f64>::new(0.5).unwrap();
        // Generate input with known mean.
        let n = 100_000;
        let data: Vec<f64> = (0..n).map(|i| (i as f64 / n as f64) - 0.5).collect();
        let input_mean: f64 = data.iter().sum::<f64>() / n as f64;

        let input = Tensor::from_storage(TensorStorage::cpu(data), vec![1, n], false).unwrap();
        let output = d.forward(&input).unwrap();
        let out_data = output.data().unwrap();
        let out_mean: f64 = out_data.iter().sum::<f64>() / n as f64;

        // Mean should be roughly preserved (within statistical tolerance).
        assert!(
            (out_mean - input_mean).abs() < 0.05,
            "AlphaDropout mean = {out_mean}, input mean = {input_mean}"
        );
    }

    #[test]
    fn test_alpha_dropout_eval_is_identity() {
        let mut d = AlphaDropout::<f32>::new(0.5).unwrap();
        d.eval();
        let input = ferrotorch_core::ones::<f32>(&[100]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_alpha_dropout_zero_prob_is_identity() {
        let d = AlphaDropout::<f32>::new(0.0).unwrap();
        let input = ferrotorch_core::ones::<f32>(&[100]).unwrap();
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
    }

    #[test]
    fn test_alpha_dropout_invalid_p() {
        assert!(AlphaDropout::<f32>::new(1.0).is_err());
        assert!(AlphaDropout::<f32>::new(-0.1).is_err());
        assert!(AlphaDropout::<f32>::new(1.5).is_err());
    }

    #[test]
    fn test_alpha_dropout_no_parameters() {
        let d = AlphaDropout::<f32>::new(0.3).unwrap();
        assert!(d.parameters().is_empty());
    }

    #[test]
    fn test_alpha_dropout_backward_routes_gradient() {
        let d = AlphaDropout::<f32>::new(0.5).unwrap();
        let input = leaf_tensor(&[1.0; 1000], &[1000], true);
        let output = d.forward(&input).unwrap();

        let out_data = output.data().unwrap().to_vec();
        let total: f32 = out_data.iter().sum();

        #[derive(Debug)]
        struct SumBackward<T: Float> {
            input: Tensor<T>,
        }
        impl<T: Float> GradFn<T> for SumBackward<T> {
            fn backward(
                &self,
                _grad_output: &Tensor<T>,
            ) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
                let ones = vec![<T as num_traits::One>::one(); self.input.numel()];
                let t = Tensor::from_storage(
                    TensorStorage::cpu(ones),
                    self.input.shape().to_vec(),
                    false,
                )?;
                Ok(vec![Some(t)])
            }
            fn inputs(&self) -> Vec<&Tensor<T>> {
                vec![&self.input]
            }
            fn name(&self) -> &'static str {
                "SumBackward"
            }
        }

        let loss = Tensor::from_operation(
            TensorStorage::cpu(vec![total]),
            vec![],
            Arc::new(SumBackward {
                input: output.clone(),
            }),
        )
        .unwrap();
        loss.backward().unwrap();

        let grad = input.grad().unwrap().unwrap();
        let grad_data = grad.data().unwrap();

        // Gradient should have two types of values: 0 for dropped, `a` for kept.
        let mut seen_zero = false;
        let mut seen_nonzero = false;
        for &g in grad_data {
            if g == 0.0 {
                seen_zero = true;
            } else {
                seen_nonzero = true;
            }
        }
        assert!(
            seen_zero,
            "some elements should have zero gradient (dropped)"
        );
        assert!(
            seen_nonzero,
            "some elements should have nonzero gradient (kept)"
        );
    }

    #[test]
    fn test_alpha_dropout_train_eval_toggle() {
        let mut d = AlphaDropout::<f32>::new(0.5).unwrap();
        assert!(d.is_training());
        d.eval();
        assert!(!d.is_training());
        d.train();
        assert!(d.is_training());
    }

    #[test]
    fn test_alpha_dropout_is_send_sync() {
        fn assert_send_sync<T: Send + Sync>() {}
        assert_send_sync::<AlphaDropout<f32>>();
        assert_send_sync::<AlphaDropout<f64>>();
    }

    // -----------------------------------------------------------------------
    // inplace=true — blocker #1446
    //
    // Mirrors torch's `_VF.dropout_` / `_VF.feature_dropout_` family
    // (`torch/nn/functional.py:1449,1516,1579,1629`): with `inplace=True` and
    // training, the input tensor's storage is mutated (mask + scale written
    // back) instead of a fresh buffer being allocated. The mask-based backward
    // keeps autograd correct.
    // -----------------------------------------------------------------------

    /// A minimal sum-reduction backward node used to drive `.backward()` in
    /// the in-place gradient tests below.
    #[derive(Debug)]
    struct SumBackward<T: Float> {
        input: Tensor<T>,
    }
    impl<T: Float> GradFn<T> for SumBackward<T> {
        fn backward(&self, _grad_output: &Tensor<T>) -> FerrotorchResult<Vec<Option<Tensor<T>>>> {
            let ones = vec![<T as num_traits::One>::one(); self.input.numel()];
            let t =
                Tensor::from_storage(TensorStorage::cpu(ones), self.input.shape().to_vec(), false)?;
            Ok(vec![Some(t)])
        }
        fn inputs(&self) -> Vec<&Tensor<T>> {
            vec![&self.input]
        }
        fn name(&self) -> &'static str {
            "SumBackward"
        }
    }

    // (a) inplace=true mutates the SAME input storage. The input buffer (all
    //     ones before forward) is overwritten with the masked / scaled values
    //     {0, 2.0}. Verified by reading `input.data()` AFTER forward.
    #[test]
    fn test_dropout_inplace_mutates_input_storage() {
        let d = Dropout::<f32>::new(0.5).unwrap().with_inplace(true);
        assert!(d.inplace());

        // Leaf without grad so we can re-read the input storage directly.
        let buf = vec![1.0f32; 10_000];
        let input = leaf_tensor(&buf, &[10_000], false);
        // Before forward: every element is 1.0.
        assert!(input.data().unwrap().iter().all(|&x| x == 1.0));

        let output = d.forward(&input).unwrap();

        // After forward: the INPUT storage itself has been mutated to the
        // post-dropout values (0.0 dropped, 2.0 = 1/(1-0.5) survivors). This
        // is the load-bearing in-place observation.
        let in_after = input.data().unwrap();
        assert!(
            in_after.contains(&0.0),
            "inplace forward must have zeroed some input elements"
        );
        for &x in in_after {
            assert!(
                x == 0.0 || (x - 2.0).abs() < 1e-6,
                "mutated input element = {x}, expected 0.0 or 2.0"
            );
        }

        // (b) The output equals the mutated input element-for-element: the
        //     in-place write and the returned buffer carry the identical mask.
        let out_data = output.data().unwrap();
        assert_eq!(out_data.len(), in_after.len());
        for (i, (&o, &x)) in out_data.iter().zip(in_after.iter()).enumerate() {
            assert_eq!(o, x, "output/input mismatch at {i}: out={o}, in={x}");
        }
    }

    // (d) eval-mode inplace is identity — torch's `F.dropout(.., training=False,
    //     inplace=True)` returns the input untouched (the `_VF.dropout_` branch
    //     is never reached because training is False; see functional.py:1448).
    #[test]
    fn test_dropout_inplace_eval_is_identity() {
        let mut d = Dropout::<f32>::new(0.5).unwrap().with_inplace(true);
        d.eval();
        let input = leaf_tensor(&[1.0; 100], &[100], false);
        let output = d.forward(&input).unwrap();
        // Identity: same tensor object returned, input storage untouched.
        assert!(output.is_same(&input));
        assert!(input.data().unwrap().iter().all(|&x| x == 1.0));
    }

    // p == 0 with inplace=true is also identity.
    #[test]
    fn test_dropout_inplace_p_zero_is_identity() {
        let d = Dropout::<f32>::new(0.0).unwrap().with_inplace(true);
        let input = leaf_tensor(&[1.0; 100], &[100], false);
        let output = d.forward(&input).unwrap();
        assert!(output.is_same(&input));
        assert!(input.data().unwrap().iter().all(|&x| x == 1.0));
    }

    // (c) backward through an in-place dropout on a grad-tracked NON-LEAF is
    //     correct: the autograd-safe policy falls back to out-of-place (no
    //     version counter to prove the shared storage is unused), so the input
    //     storage is NOT mutated, but the gradient still routes only through
    //     surviving elements (0 for dropped, 2.0 for kept) and the grad mask
    //     matches the output mask, exactly as the out-of-place path.
    #[test]
    fn test_dropout_inplace_backward_routes_through_surviving() {
        use ferrotorch_core::grad_fns::arithmetic::mul;

        let d = Dropout::<f32>::new(0.5).unwrap().with_inplace(true);
        // Non-leaf grad-tracked input: `t = x * 1` requires grad but is not a
        // leaf, so `apply_inplace_dropout` takes the out-of-place fallback
        // rather than erroring on the leaf guard.
        let x = leaf_tensor(&[1.0; 1000], &[1000], true);
        let ones = leaf_tensor(&[1.0; 1000], &[1000], false);
        let input = mul(&x, &ones).unwrap();
        assert!(input.requires_grad() && !input.is_leaf());
        let input_before = input.data().unwrap().to_vec();

        let output = d.forward(&input).unwrap();

        // Safe fallback: the grad-tracked non-leaf storage is left UNMUTATED.
        let input_after = input.data().unwrap().to_vec();
        assert_eq!(
            input_before, input_after,
            "in-place dropout on a grad-tracked non-leaf must fall back to \
             out-of-place and leave the input storage untouched (no version \
             counter to prove the shared storage is unused)"
        );

        let out_data = output.data().unwrap().to_vec();
        let total: f32 = out_data.iter().sum();
        let loss = Tensor::from_operation(
            TensorStorage::cpu(vec![total]),
            vec![],
            Arc::new(SumBackward {
                input: output.clone(),
            }),
        )
        .unwrap();
        loss.backward().unwrap();

        // Gradient flows back to the leaf `x` through the out-of-place dropout.
        let grad = x.grad().unwrap().unwrap();
        let grad_data = grad.data().unwrap();
        for &g in grad_data {
            assert!(
                g == 0.0 || (g - 2.0).abs() < 1e-6,
                "gradient element = {g}, expected 0.0 or 2.0"
            );
        }
        // grad mask matches output mask: dropped iff zero gradient.
        for (i, (&o, &g)) in out_data.iter().zip(grad_data.iter()).enumerate() {
            assert_eq!(
                o == 0.0,
                g == 0.0,
                "mismatch at index {i}: out={o}, grad={g}"
            );
        }
    }

    // (c2) in-place dropout on a grad-requiring LEAF errors, matching torch's
    //      leaf in-place guard (`torch/csrc/autograd/VariableTypeUtils.h:80-84`,
    //      "a leaf Variable that requires grad is being used in an in-place
    //      operation."). Pins #1581.
    #[test]
    fn test_dropout_inplace_on_grad_leaf_errors() {
        let original = vec![1.0f32; 100];
        let d = Dropout::<f32>::new(0.5).unwrap().with_inplace(true);
        let input = leaf_tensor(&original, &[100], true);
        assert!(input.is_leaf() && input.requires_grad());

        let err = d.forward(&input).unwrap_err();
        match err {
            FerrotorchError::InvalidArgument { message } => assert!(
                message.contains("leaf Variable that requires grad"),
                "expected torch leaf-guard message, got: {message}"
            ),
            other => panic!("expected InvalidArgument leaf-guard error, got {other:?}"),
        }
        // The leaf storage is left untouched (no partial mutation before error).
        assert_eq!(input.data().unwrap().to_vec(), original);
    }

    // (e) all four standard dropout variants honor inplace: the input storage
    //     is mutated channel-wise (or element-wise for `Dropout`).
    #[test]
    fn test_dropout2d_inplace_mutates_input_storage() {
        let d = Dropout2d::<f32>::new(0.5).unwrap().with_inplace(true);
        assert!(d.inplace());
        let input = leaf_tensor(&[1.0; 2 * 500 * 4], &[2, 500, 2, 2], false);
        let _ = d.forward(&input).unwrap();
        let in_after = input.data().unwrap();
        // Channel-wise: each (b, c) block of 4 spatial elems is uniform.
        let spatial = 4;
        let mut saw_dropped = false;
        for blk in in_after.chunks(spatial) {
            let first = blk[0];
            assert!(blk.iter().all(|&x| (x - first).abs() < 1e-6));
            assert!(first == 0.0 || (first - 2.0).abs() < 1e-6);
            if first == 0.0 {
                saw_dropped = true;
            }
        }
        assert!(
            saw_dropped,
            "inplace dropout2d must have zeroed some channels"
        );
    }

    #[test]
    fn test_dropout1d_inplace_mutates_input_storage() {
        let d = Dropout1d::<f32>::new(0.5).unwrap().with_inplace(true);
        assert!(d.inplace());
        let input = leaf_tensor(&[1.0; 500 * 4], &[1, 500, 4], false);
        let _ = d.forward(&input).unwrap();
        let in_after = input.data().unwrap();
        let mut saw_dropped = false;
        for blk in in_after.chunks(4) {
            let first = blk[0];
            assert!(blk.iter().all(|&x| (x - first).abs() < 1e-6));
            assert!(first == 0.0 || (first - 2.0).abs() < 1e-6);
            if first == 0.0 {
                saw_dropped = true;
            }
        }
        assert!(
            saw_dropped,
            "inplace dropout1d must have zeroed some channels"
        );
    }

    #[test]
    fn test_dropout3d_inplace_mutates_input_storage() {
        let d = Dropout3d::<f32>::new(0.5).unwrap().with_inplace(true);
        assert!(d.inplace());
        let input = leaf_tensor(&[1.0; 500 * 8], &[1, 500, 2, 2, 2], false);
        let _ = d.forward(&input).unwrap();
        let in_after = input.data().unwrap();
        let mut saw_dropped = false;
        for blk in in_after.chunks(8) {
            let first = blk[0];
            assert!(blk.iter().all(|&x| (x - first).abs() < 1e-6));
            assert!(first == 0.0 || (first - 2.0).abs() < 1e-6);
            if first == 0.0 {
                saw_dropped = true;
            }
        }
        assert!(
            saw_dropped,
            "inplace dropout3d must have zeroed some channels"
        );
    }

    // The non-inplace path is the default and leaves the input untouched —
    // confirms inplace=false (existing behavior) is preserved.
    #[test]
    fn test_dropout_default_is_not_inplace() {
        let d = Dropout::<f32>::new(0.5).unwrap();
        assert!(!d.inplace());
        let input = leaf_tensor(&[1.0; 1000], &[1000], false);
        let _ = d.forward(&input).unwrap();
        // Input untouched: still all ones.
        assert!(input.data().unwrap().iter().all(|&x| x == 1.0));
    }

    // AlphaDropout / FeatureAlphaDropout carry the `inplace` field for ABI
    // parity but — matching torch's module forward (`dropout.py:265-269`,
    // `319-323`, which never pass `self.inplace` to the functional) — do NOT
    // mutate the input even when inplace=true. The field is observable via the
    // `inplace()` getter.
    #[test]
    fn test_alpha_dropout_inplace_field_does_not_mutate() {
        let d = AlphaDropout::<f32>::new(0.5).unwrap().with_inplace(true);
        assert!(d.inplace(), "field is retained for API parity");
        let input = leaf_tensor(&[1.0; 1000], &[1000], false);
        let _ = d.forward(&input).unwrap();
        // Matching torch: the module forward ignores inplace, input untouched.
        assert!(
            input.data().unwrap().iter().all(|&x| x == 1.0),
            "AlphaDropout module forward must not mutate in place (matches torch dropout.py:265-269)"
        );
    }

    #[test]
    fn test_feature_alpha_dropout_inplace_field_does_not_mutate() {
        let d = FeatureAlphaDropout::<f32>::new(0.5)
            .unwrap()
            .with_inplace(true);
        assert!(d.inplace(), "field is retained for API parity");
        let input = leaf_tensor(&[1.0; 1000], &[1, 1000], false);
        let _ = d.forward(&input).unwrap();
        assert!(
            input.data().unwrap().iter().all(|&x| x == 1.0),
            "FeatureAlphaDropout module forward must not mutate in place (matches torch dropout.py:319-323)"
        );
    }

    // -----------------------------------------------------------------------
    // Seed-reproducible byte-match vs LIVE torch 2.11 (#1635 / #1636).
    //
    // Reference values produced by live torch under `torch.manual_seed(42)`
    // — NOT copied from the ferrotorch side (R-CHAR-3). The per-channel /
    // per-element masks come from the byte-exact MT19937 `Generator`, so a
    // shared `ferrotorch_core::manual_seed(42)` reproduces torch's stream.
    // -----------------------------------------------------------------------

    fn ones_shape_t(shape: &[usize]) -> Tensor<f32> {
        let n: usize = shape.iter().product();
        Tensor::from_storage(TensorStorage::cpu(vec![1.0f32; n]), shape.to_vec(), false).unwrap()
    }

    /// `torch.manual_seed(42); F.dropout2d(ones(1,8,1,1),0.5,True)` per-channel
    /// -> survivors scaled by 1/(1-0.5)=2 in the MT19937 keep pattern
    /// [keep,keep,keep,keep,DROP,keep,DROP,DROP].
    #[test]
    fn test_dropout2d_seed42_matches_torch() {
        let want = [2.0, 2.0, 2.0, 2.0, 0.0, 2.0, 0.0, 0.0];
        ferrotorch_core::rng::manual_seed(42);
        let d = Dropout2d::<f32>::new(0.5).unwrap();
        let y = d.forward(&ones_shape_t(&[1, 8, 1, 1])).unwrap();
        assert_eq!(y.data().unwrap(), &want);
    }

    /// `torch.manual_seed(42); F.dropout1d(ones(1,6,3),0.5,True)` per-channel
    /// -> [2,2,2,2,0,2], broadcast over the length-3 dim.
    #[test]
    fn test_dropout1d_seed42_matches_torch() {
        let want = [2.0, 2.0, 2.0, 2.0, 0.0, 2.0];
        ferrotorch_core::rng::manual_seed(42);
        let d = Dropout1d::<f32>::new(0.5).unwrap();
        let y = d.forward(&ones_shape_t(&[1, 6, 3])).unwrap();
        let data = y.data().unwrap();
        let per_chan: Vec<f32> = (0..6).map(|c| data[c * 3]).collect();
        assert_eq!(per_chan.as_slice(), &want);
    }

    /// `torch.manual_seed(42); F.dropout3d(ones(1,6,1,1,1),0.5,True)` per-channel
    /// -> [2,2,2,2,0,2].
    #[test]
    fn test_dropout3d_seed42_matches_torch() {
        let want = [2.0, 2.0, 2.0, 2.0, 0.0, 2.0];
        ferrotorch_core::rng::manual_seed(42);
        let d = Dropout3d::<f32>::new(0.5).unwrap();
        let y = d.forward(&ones_shape_t(&[1, 6, 1, 1, 1])).unwrap();
        assert_eq!(y.data().unwrap(), &want);
    }

    /// Two seeded `Dropout2d` forwards under the SAME `manual_seed(42)` produce
    /// the SAME mask (MT19937 reset on manual_seed; no system-time entropy).
    #[test]
    fn test_dropout2d_reproducible_under_manual_seed() {
        let d = Dropout2d::<f32>::new(0.5).unwrap();
        ferrotorch_core::rng::manual_seed(42);
        let y1 = d.forward(&ones_shape_t(&[1, 64, 1, 1])).unwrap();
        ferrotorch_core::rng::manual_seed(42);
        let y2 = d.forward(&ones_shape_t(&[1, 64, 1, 1])).unwrap();
        assert_eq!(y1.data().unwrap(), y2.data().unwrap());
    }

    /// `torch.manual_seed(42); nn.AlphaDropout(0.5).train()(ones(10))`
    /// -> kept = 1.6655989, dropped = -0.7791939 in the MT19937 keep pattern.
    /// kept/dropped values from torch's exact affine (`Dropout.cpp:74-79`),
    /// alpha = 1.7580993408473766.
    #[test]
    fn test_alpha_dropout_seed42_matches_torch() {
        let want = [
            1.6655989, 1.6655989, 1.6655989, 1.6655989, -0.7791939, 1.6655989, -0.7791939,
            -0.7791939, 1.6655989, 1.6655989,
        ];
        ferrotorch_core::rng::manual_seed(42);
        let d = AlphaDropout::<f32>::new(0.5).unwrap();
        let y = d.forward(&ones_shape_t(&[10])).unwrap();
        let got = y.data().unwrap();
        for (i, (&g, &w)) in got.iter().zip(want.iter()).enumerate() {
            assert!((g - w).abs() < 1e-4, "elem {i}: got {g} want {w}");
        }
    }

    /// `torch.manual_seed(42); nn.FeatureAlphaDropout(0.5).train()(ones(1,6,1,1))`
    /// per-channel -> [1.6655989 ×4, -0.7791939, 1.6655989].
    #[test]
    fn test_feature_alpha_dropout_seed42_matches_torch() {
        let want = [
            1.6655989, 1.6655989, 1.6655989, 1.6655989, -0.7791939, 1.6655989,
        ];
        ferrotorch_core::rng::manual_seed(42);
        let d = FeatureAlphaDropout::<f32>::new(0.5).unwrap();
        let y = d.forward(&ones_shape_t(&[1, 6, 1, 1])).unwrap();
        let got = y.data().unwrap();
        for (i, (&g, &w)) in got.iter().zip(want.iter()).enumerate() {
            assert!((g - w).abs() < 1e-4, "elem {i}: got {g} want {w}");
        }
    }
}