ferrompi 0.4.1 - Docs.rs

//! # ferrompi
//!
//! Safe, generic Rust bindings for MPI (Message Passing Interface).
//!
//! This crate wraps MPI functionality through a thin C layer, providing:
//! - Type-safe generic API for all MPI datatypes
//! - Blocking, nonblocking, and persistent (MPI 4.0+) collectives
//! - Communicator management (split, duplicate)
//! - RMA shared memory windows (with `rma` feature)
//! - SLURM environment helpers (with `numa` feature)
//! - Large count support (MPI 4.0+ `_c` variants for blocking/nonblocking
//!   collectives; persistent collectives currently reject `count > INT_MAX`
//!   with `MPI_ERR_COUNT` — full `_c` dispatch for persistent ops is deferred)
//!
//! ## Supported Types
//!
//! All communication operations are generic over [`MpiDatatype`]:
//! `f32`, `f64`, `i32`, `i64`, `u8`, `u32`, `u64`
//!
//! ## Quick Start
//!
//! ```no_run
//! use ferrompi::{Mpi, ReduceOp};
//!
//! fn main() -> Result<(), ferrompi::Error> {
//!     let mpi = Mpi::init()?;
//!     let world = mpi.world();
//!
//!     let rank = world.rank();
//!     let size = world.size();
//!     println!("Hello from rank {} of {}", rank, size);
//!
//!     // Generic broadcast — works with any MpiDatatype
//!     let mut data = vec![0.0f64; 100];
//!     if rank == 0 {
//!         data.fill(42.0);
//!     }
//!     world.broadcast(&mut data, 0)?;
//!
//!     // Generic all-reduce
//!     let sum = world.allreduce_scalar(rank as f64, ReduceOp::Sum)?;
//!     println!("Rank {rank}: sum of all ranks = {sum}");
//!
//!     Ok(())
//! }
//! ```
//!
//! ## Feature Flags
//!
//! | Feature | Description | Dependencies |
//! |---------|-------------|--------------|
//! | `rma`   | RMA shared memory window operations | — |
//! | `numa`  | NUMA-aware windows and SLURM helpers | `rma` |
//!
//! ## Capabilities
//!
//! - **Generic API**: All operations work with any [`MpiDatatype`] (`f32`, `f64`, `i32`, `i64`, `u8`, `u32`, `u64`)
//! - **Blocking collectives**: barrier, broadcast, reduce, allreduce, gather, scatter, allgather,
//!   alltoall, scan, exscan, reduce\_scatter\_block, plus V-variants (gatherv, scatterv, allgatherv, alltoallv)
//! - **Nonblocking collectives**: All 15 `i`-prefixed variants with [`Request`] handles
//! - **Persistent collectives** (MPI 4.0+): All 15 `_init` variants with [`PersistentRequest`] handles
//! - **Scalar and in-place variants**: `reduce_scalar`, `allreduce_scalar`, `reduce_inplace`,
//!   `allreduce_inplace`, `scan_scalar`, `exscan_scalar`
//! - **Point-to-point**: `send`, `recv`, `isend`, `irecv`, `sendrecv`, `probe`, `iprobe`
//! - **Communicator management**: `split`, `split_type`, `split_shared`, `duplicate`
//! - **Group operations**: [`Group`] with incl/excl/union/intersection/difference,
//!   [`RankRange`] for range constructors, [`GroupComparison`].
//!
//!   Note: [`Mpi::create_from_group`] requires MPI 4.0+.
//!   Support is probed once and cached; see the function rustdoc for the cache invariant.
//! - **Custom datatypes**: [`CustomDatatype`]
//!   (contiguous/vector/struct/resized) and [`StructField`]
//!   for struct-type builders.
//! - **User-defined reduction operations**: [`UserOp`] wraps `MPI_Op_create`
//!   with safe closure storage and trampoline.
//! - **Distributed RMA windows** (feature `rma`): [`Win<T>`](crate::Win) with
//!   [`WinFenceAssert`], [`WinPscwAssert`],
//!   [`WinLockGuard`], and [`WinLockAllGuard`]
//!   RAII guards.
//! - **Info objects**: [`Info`] for runtime hint passing to communicator,
//!   window, and operation constructors.
//! - **Persistent point-to-point**: `send_init`, `bsend_init`, `rsend_init`, `ssend_init`,
//!   `recv_init` methods on [`Communicator`], each returning a
//!   [`PersistentRequest`].
//! - **Shared memory windows** (feature `rma`): [`SharedWindow<T>`](crate::SharedWindow)
//!   with RAII lock guards for NUMA-aware intra-node shared memory (distinct from the
//!   distributed [`Win<T>`](crate::Win) windows above).
//! - **SLURM helpers** (feature `numa`): Job topology queries via `slurm` module
//! - **Rich error handling**: [`MpiErrorClass`] categorization with messages from the MPI runtime
//!
//! ## Thread Safety
//!
//! [`Communicator`] is `Send + Sync` to support hybrid MPI + threads programs
//! (e.g., MPI between nodes, `std::thread::scope` within a node).
//!
//! The actual thread-safety guarantees depend on the thread level requested
//! at initialization:
//!
//! | Thread Level | Who can call MPI | Synchronization |
//! |--------------|------------------|-----------------|
//! | [`ThreadLevel::Single`] | Main thread only | N/A |
//! | [`ThreadLevel::Funneled`] | Main thread only | N/A |
//! | [`ThreadLevel::Serialized`] | Any thread | User must serialize |
//! | [`ThreadLevel::Multiple`] | Any thread | None needed |
//!
//! ```no_run
//! use ferrompi::{Mpi, ThreadLevel};
//!
//! // Request serialized thread support for hybrid MPI + threads
//! let mpi = Mpi::init_thread(ThreadLevel::Funneled).unwrap();
//! assert!(mpi.thread_level() >= ThreadLevel::Funneled);
//! ```
//!
//! [`Mpi`] itself is `!Send + !Sync` — MPI initialization and finalization
//! must occur on the same thread. Only [`Communicator`] handles (and the
//! operations on them) may cross thread boundaries.
//!
//! ### Send/Sync Status of Public Types
//!
//! | Type | Send/Sync | Notes |
//! |------|-----------|-------|
//! | [`Communicator`] | `Send + Sync` | Explicit `unsafe impl` in `src/comm/mod.rs`; cross-thread use is the primary hybrid MPI use case. |
//! | [`Mpi`] | `!Send + !Sync` | `PhantomData<*const ()>` field; init and finalize must occur on the same thread. |
//! | [`Group`] | `Send + Sync` | Explicit `unsafe impl` in `src/group.rs`; handles are opaque integers, MPI-thread-safe under `MPI_THREAD_MULTIPLE`. |
//! | [`Request`] | `Send + Sync` | Auto-derived; `i64` + `bool` fields. Cross-thread use requires `MPI_THREAD_MULTIPLE`. Buffer-lifetime invariant still applies. |
//! | [`PersistentRequest`] | `Send + Sync` | Auto-derived; same shape as `Request`. ADR-0004 §"Drop behavior" applies across thread boundaries. |
//! | [`Status`] | `Send + Sync` | POD wrapper; all fields are `Copy`. |
//! | [`CustomDatatype`] | `Send + Sync` | Explicit `unsafe impl` in `src/datatype_builder.rs`; handle is an opaque integer. |
//! | [`Info`] | `Send + Sync` | Auto-derived from `i32` + `bool` fields; MPI info objects are thread-safe under `MPI_THREAD_MULTIPLE`. |
//! | [`UserOp<T>`] | `Send + Sync` (for `T: MpiDatatype`) | Auto-derived: fields are `i32` + `PhantomData<T>`. The trait bound `MpiDatatype: Copy + Send + 'static` and the fact that all concrete `MpiDatatype` impls are also `Sync` give `Send + Sync` for `UserOp<T>`. The global closure registry uses internal `unsafe impl Send/Sync` on its slots; that is a separate object from `UserOp<T>` itself. |
//! | [`Win<T>`](crate::Win) (feature `rma`) | `!Send + !Sync` | `NonNull<T>` field suppresses auto-traits; RMA window's local memory pointer is not safe to share across threads. |
//! | [`SharedWindow<T>`](crate::SharedWindow) (feature `rma`) | `!Send + !Sync` | `NonNull<T>` field; same rationale as `Win<T>`. |
//! | [`LockGuard<'a, T>`] (feature `rma`) | `!Send + !Sync` | Borrows `SharedWindow<T>`; inherits non-Send/Sync. |
//! | [`LockAllGuard<'a, T>`] (feature `rma`) | `!Send + !Sync` | Borrows `SharedWindow<T>`; inherits non-Send/Sync. |
//! | [`WinLockGuard<'g, 'a, T>`](crate::WinLockGuard) (feature `rma`) | `!Send + !Sync` | Borrows `Win<T>`; inherits non-Send/Sync. |
//! | [`WinLockAllGuard<'g, 'a, T>`](crate::WinLockAllGuard) (feature `rma`) | `!Send + !Sync` | Borrows `Win<T>`; inherits non-Send/Sync. |
//!
//! ## Hybrid MPI+OpenMP
//!
//! For hybrid parallelism, use [`Mpi::init_thread()`] with the appropriate level:
//!
//! - **[`Funneled`](ThreadLevel::Funneled)** (recommended): Only the main thread makes MPI calls.
//!   OpenMP threads handle computation between MPI calls.
//! - **[`Serialized`](ThreadLevel::Serialized)**: Any thread can make MPI calls, but only one at a time.
//! - **[`Multiple`](ThreadLevel::Multiple)**: Full concurrent MPI from any thread (highest overhead).
//!
//! ```no_run
//! use ferrompi::{Mpi, ThreadLevel, ReduceOp};
//!
//! let mpi = Mpi::init_thread(ThreadLevel::Funneled).unwrap();
//! assert!(mpi.thread_level() >= ThreadLevel::Funneled);
//!
//! let world = mpi.world();
//! // Worker threads compute locally, main thread calls MPI
//! let local = 42.0_f64;
//! let global = world.allreduce_scalar(local, ReduceOp::Sum).unwrap();
//! ```
//!
//! ### SLURM Configuration
//!
//! ```bash
//! #SBATCH --ntasks-per-node=4        # MPI ranks per node
//! #SBATCH --cpus-per-task=8          # OpenMP threads per rank
//! #SBATCH --bind-to core             # Pin MPI ranks
//! export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
//! srun ./my_program
//! ```
//!
//! Use the `slurm` module (with `numa` feature) to read these values at runtime.
//! See `examples/hybrid_openmp.rs` for the full pattern.
//!
//! ## Extended documentation
//!
//! Long-form documentation artifacts are embedded in the [`doc`] module and
//! render as individual pages in this rustdoc. The same content is available
//! as plain Markdown in the `docs/` directory.
//!
//! | Module | Description |
//! |--------|-------------|
//! | [`doc::architecture`] | Six-layer stack, handle tables, thread-safety model, FFI/ABI invariants, and generic `MpiDatatype` design |
//! | [`doc::migrating_from_rsmpi`] | Function-for-function API mapping and migration cookbook from rsmpi |
//! | [`doc::mpi_compatibility`] | Compatibility matrix for MPICH, Open MPI, Intel MPI, and Cray MPI |
//! | [`doc::adr_0001_why_c_wrapper`] | ADR-0001: why a hand-written C wrapper is used instead of `bindgen` |
//! | [`doc::adr_0002_handle_tables`] | ADR-0002: C11 atomic CAS strategy for the request-table under `MPI_THREAD_MULTIPLE` |
//! | [`doc::adr_0003_generic_mpi_datatype`] | ADR-0003: sealed `MpiDatatype` trait family and `DatatypeTag` ABI contract |
//! | [`doc::adr_0004_persistent_collective_approach`] | ADR-0004: `PersistentRequest` lifecycle and buffer-lifetime invariants |
//! | [`doc::adr_0005_mpi_op_create`] | ADR-0005: `MPI_Op_create` closure storage, trampoline safety, and drop ordering |

#![warn(missing_docs)]
#![warn(clippy::all)]
// Clippy suppressions live at the call site (`#[allow(clippy::NAME)]`
// with a justification comment) rather than crate-wide.

use std::ffi::{c_char, CString};

mod comm;
mod datatype;
mod datatype_builder;
pub mod doc;
mod error;
mod ffi;
mod group;
mod info;
mod op;
mod persistent;
mod request;
#[cfg(feature = "numa")]
pub mod slurm;
mod status;
mod topology;
#[cfg(feature = "rma")]
mod window;

pub use comm::{Communicator, SplitType};
#[cfg(feature = "rma")]
pub use datatype::AtomicMpiDatatype;
pub use datatype::{
    BytePermutable, DatatypeTag, DoubleInt, FloatInt, Int2, LongDoubleInt, LongInt, MpiDatatype,
    MpiIndexedDatatype, ShortInt,
};
pub use datatype_builder::{CustomDatatype, StructField};
pub use error::{Error, MpiErrorClass, Result};
pub use group::{Group, GroupComparison, RankRange};
pub use info::Info;
pub use op::UserOp;
pub use persistent::PersistentRequest;
pub use request::Request;
pub use status::Status;
#[cfg(feature = "numa")]
pub use topology::SlurmInfo;
pub use topology::{HostEntry, TopologyInfo};
#[cfg(feature = "rma")]
pub use window::{
    LockAllGuard, LockGuard, LockType, PendingFetchResult, SharedWindow, Win, WinFenceAssert,
    WinKind, WinLockAllGuard, WinLockGuard, WinPscwAssert,
};

use std::marker::PhantomData;
use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::{Mutex, OnceLock};

/// Global flag tracking whether MPI has been initialized
static MPI_INITIALIZED: AtomicBool = AtomicBool::new(false);

/// Process-wide buffer attached for buffered sends (`MPI_Buffer_attach`).
///
/// MPI allows at most one attached buffer per process. This static holds the
/// `Box<[u8]>` so that the allocation remains valid for the duration of the
/// attachment. The `Mutex` is held only briefly during `buffer_attach` and
/// `buffer_detach` transitions; MPI itself manages the buffer between those
/// two calls.
static ATTACHED_BUFFER: Mutex<Option<Box<[u8]>>> = Mutex::new(None);

/// MPI thread support levels
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(i32)]
pub enum ThreadLevel {
    /// Only single-threaded execution
    Single = 0,
    /// Multi-threaded, but MPI calls only from main thread
    Funneled = 1,
    /// Multi-threaded, but MPI calls serialized by user
    Serialized = 2,
    /// Full multi-threaded support
    Multiple = 3,
}

/// Reduction operations
///
/// The `Replace` and `NoOp` variants are only available with the `rma` feature.
///
/// # Feature-gated variants
///
/// Without `--features rma`, referencing `ReduceOp::Replace` is a compile error:
///
#[cfg_attr(not(feature = "rma"), doc = "```compile_fail")]
#[cfg_attr(
    not(feature = "rma"),
    doc = "// This must not compile without --features rma."
)]
#[cfg_attr(not(feature = "rma"), doc = "let _ = ferrompi::ReduceOp::Replace;")]
#[cfg_attr(not(feature = "rma"), doc = "```")]
#[cfg_attr(feature = "rma", doc = "```no_run")]
#[cfg_attr(
    feature = "rma",
    doc = "// With --features rma, ReduceOp::Replace is available."
)]
#[cfg_attr(feature = "rma", doc = "let _ = ferrompi::ReduceOp::Replace;")]
#[cfg_attr(feature = "rma", doc = "```")]
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[repr(i32)]
pub enum ReduceOp {
    /// Sum of values
    Sum = 0,
    /// Maximum value
    Max = 1,
    /// Minimum value
    Min = 2,
    /// Product of values
    Prod = 3,
    /// Bitwise OR (`MPI_BOR`). Valid only for integer types; MPI returns
    /// `MPI_ERR_OP` when used with floating-point types.
    BitwiseOr = 4,
    /// Bitwise AND (`MPI_BAND`). Valid only for integer types; MPI returns
    /// `MPI_ERR_OP` when used with floating-point types.
    BitwiseAnd = 5,
    /// Bitwise XOR (`MPI_BXOR`). Valid only for integer types; MPI returns
    /// `MPI_ERR_OP` when used with floating-point types.
    BitwiseXor = 6,
    /// Logical OR (`MPI_LOR`). Interprets nonzero as `true`. Valid for
    /// integer types.
    LogicalOr = 7,
    /// Logical AND (`MPI_LAND`). Interprets nonzero as `true`. Valid for
    /// integer types.
    LogicalAnd = 8,
    /// Logical XOR (`MPI_LXOR`). Interprets nonzero as `true`. Valid for
    /// integer types.
    LogicalXor = 9,
    /// Maximum value with location (`MPI_MAXLOC`). Returns the maximum value
    /// and the rank (index) where it occurred. Only valid with
    /// [`MpiIndexedDatatype`] via
    /// [`Communicator::allreduce_indexed`](crate::Communicator::allreduce_indexed).
    MaxLoc = 10,
    /// Minimum value with location (`MPI_MINLOC`). Returns the minimum value
    /// and the rank (index) where it occurred. Only valid with
    /// [`MpiIndexedDatatype`] via
    /// [`Communicator::allreduce_indexed`](crate::Communicator::allreduce_indexed).
    MinLoc = 11,
    /// Replace the target buffer with the source value (`MPI_REPLACE`).
    ///
    /// Only valid for `MPI_Accumulate`-family operations (Epic 7). Passing to
    /// `allreduce`, `reduce`, `scan`, etc. returns `MPI_ERR_OP` from MPI.
    ///
    /// This variant is only present when the `rma` feature is enabled.
    #[cfg(feature = "rma")]
    Replace = 12,
    /// No-op: leaves the target buffer unchanged (`MPI_NO_OP`).
    ///
    /// Only valid for `MPI_Accumulate`-family operations (Epic 7). Passing to
    /// `allreduce`, `reduce`, `scan`, etc. returns `MPI_ERR_OP` from MPI.
    ///
    /// # Compile-time availability
    ///
    /// This variant is only present when the `rma` feature is enabled.
    #[cfg(feature = "rma")]
    NoOp = 13,
}

/// MPI environment handle.
///
/// This type represents an initialized MPI environment. There can only be one
/// instance of this type at a time. When dropped, it finalizes MPI.
///
/// # Example
///
/// ```no_run
/// use ferrompi::Mpi;
///
/// let mpi = Mpi::init().expect("Failed to initialize MPI");
/// let world = mpi.world();
/// println!("Running on {} processes", world.size());
/// // MPI is finalized when `mpi` goes out of scope
/// ```
pub struct Mpi {
    /// The thread level that was provided
    thread_level: ThreadLevel,
    /// Marker to make Mpi !Send and !Sync
    _marker: PhantomData<*const ()>,
}

impl Mpi {
    /// Initialize MPI with single-threaded support.
    ///
    /// # Errors
    ///
    /// Returns an error if MPI is already initialized or if initialization fails.
    pub fn init() -> Result<Self> {
        Self::init_thread(ThreadLevel::Single)
    }

    /// Initialize MPI with the specified thread support level.
    ///
    /// # Arguments
    ///
    /// * `required` - The minimum thread support level required
    ///
    /// # Returns
    ///
    /// Returns the MPI handle. The actual thread support level provided can be
    /// queried with [`thread_level()`](Self::thread_level).
    ///
    /// # Errors
    ///
    /// Returns an error if MPI is already initialized or if initialization fails.
    pub fn init_thread(required: ThreadLevel) -> Result<Self> {
        // Check if already initialized
        if MPI_INITIALIZED.swap(true, Ordering::SeqCst) {
            return Err(Error::AlreadyInitialized);
        }

        let mut provided: i32 = 0;
        let ret = unsafe { ffi::ferrompi_init_thread(required as i32, &mut provided) };

        if ret != 0 {
            MPI_INITIALIZED.store(false, Ordering::SeqCst);
            return Err(Error::Mpi {
                class: MpiErrorClass::Raw(ret),
                code: ret,
                message: format!("MPI_Init_thread failed with code {ret}"),
                operation: Some("init_thread"),
            });
        }

        let thread_level = match provided {
            0 => ThreadLevel::Single,
            1 => ThreadLevel::Funneled,
            2 => ThreadLevel::Serialized,
            _ => ThreadLevel::Multiple,
        };

        Ok(Mpi {
            thread_level,
            _marker: PhantomData,
        })
    }

    /// Get the thread support level that was provided.
    pub fn thread_level(&self) -> ThreadLevel {
        self.thread_level
    }

    /// Get a handle to `MPI_COMM_WORLD`.
    pub fn world(&self) -> Communicator {
        Communicator::world()
    }

    /// Get the current wall-clock time.
    ///
    /// This is a high-resolution timer suitable for benchmarking.
    pub fn wtime() -> f64 {
        unsafe { ffi::ferrompi_wtime() }
    }

    /// Get the MPI library version string (implementation-specific).
    ///
    /// Returns a string such as `"Open MPI v4.1.6"` or `"Intel(R) MPI Library 2021.7"`.
    /// This wraps `MPI_Get_library_version`.
    pub fn library_version() -> Result<String> {
        // MPI_MAX_LIBRARY_VERSION_STRING is 8192 in most implementations.
        let mut buf = [0u8; 8192];
        let mut len: i32 = 0;
        let ret = unsafe {
            ffi::ferrompi_get_library_version(buf.as_mut_ptr().cast::<c_char>(), &mut len)
        };
        Error::check_with_op(ret, "get_library_version")?;
        let len = (len.max(0) as usize).min(buf.len());
        // Trim trailing whitespace/newlines that some implementations append.
        let s = std::str::from_utf8(&buf[..len])
            .map_err(|_| Error::Internal("Invalid UTF-8 in library version string".into()))?;
        Ok(s.trim_end().to_string())
    }

    /// Get the MPI standard version string (e.g., "MPI 4.0").
    pub fn version() -> Result<String> {
        let mut buf = [0u8; 256];
        let mut len: i32 = 0;
        let ret = unsafe { ffi::ferrompi_get_version(buf.as_mut_ptr().cast::<c_char>(), &mut len) };

        if ret != 0 {
            return Err(Error::from_code_with_op(ret, "get_version"));
        }

        let len = (len.max(0) as usize).min(buf.len());
        let s = std::str::from_utf8(&buf[..len])
            .map_err(|_| Error::Internal("Invalid UTF-8 in version string".into()))?;
        Ok(s.to_string())
    }

    /// Check if MPI has been initialized.
    pub fn is_initialized() -> bool {
        let mut flag: i32 = 0;
        unsafe { ffi::ferrompi_initialized(&mut flag) };
        flag != 0
    }

    /// Check if MPI has been finalized.
    pub fn is_finalized() -> bool {
        let mut flag: i32 = 0;
        unsafe { ffi::ferrompi_finalized(&mut flag) };
        flag != 0
    }

    /// Returns `true` if the runtime MPI version is 4.0 or later.
    ///
    /// Caching semantics:
    /// - An `Err` from `Mpi::version()` (e.g., called before `Mpi::init`)
    ///   is NOT cached; a subsequent call after init can re-probe and
    ///   observe support correctly.
    /// - A successful `version()` whose string parses to a major version
    ///   `≥ 4` caches `true`.
    /// - A successful `version()` whose string parses to a major version
    ///   `< 4` (including unrecognized formats that yield major `= 0` via
    ///   the `unwrap_or(0)` fallback) caches `false`. Re-probing is not
    ///   possible once a successful `version()` has been seen, so a
    ///   non-standard version string format from an unusual MPI build
    ///   permanently disables `Mpi::create_from_group` for this process.
    fn supports_create_from_group() -> bool {
        static SUPPORTED: OnceLock<bool> = OnceLock::new();
        if let Some(&cached) = SUPPORTED.get() {
            return cached;
        }
        // Probe. If `version()` fails, return `false` without caching;
        // a future call can re-probe successfully.
        let Ok(v) = Mpi::version() else {
            return false;
        };
        // Mpi::version() returns a string like "MPI 4.0" or "MPI 3.1".
        // Extract the major version number from the second whitespace-delimited
        // token, then its first dot-delimited component.
        let major: u32 = v
            .split_whitespace()
            .nth(1)
            .and_then(|tok| tok.split('.').next())
            .and_then(|s| s.parse().ok())
            .unwrap_or(0);
        let supported = major >= 4;
        // Race: if another thread already won set(), discard our value;
        // both threads agree on the result anyway because the probe is
        // deterministic given a successful version() call.
        let _ = SUPPORTED.set(supported);
        supported
    }

    /// Create a communicator from a group without requiring a parent
    /// communicator (MPI 4.0+).
    ///
    /// `stringtag` must be identical across all ranks that participate
    /// in the call; ranks with different tags or in different groups
    /// produce separate communicators.
    ///
    /// # Errors
    ///
    /// - Returns `Err(Error::Internal(_))` if `stringtag` contains a null byte
    ///   (the FFI call is never invoked in this case).
    /// - Returns `Err(Error::NotSupported("MPI_Comm_create_from_group"))` on
    ///   MPI < 4.0 installations.
    /// - Returns `Err(Error::Mpi { .. })` if the underlying MPI call fails.
    ///
    /// # Example
    ///
    /// ```no_run
    /// use ferrompi::Mpi;
    ///
    /// let mpi = Mpi::init().unwrap();
    /// let world = mpi.world();
    /// let g = world.group().unwrap();
    /// // All ranks participate with the same tag.
    /// let comm = mpi.create_from_group(&g, "my-tag").unwrap();
    /// assert_eq!(comm.size(), world.size());
    /// ```
    pub fn create_from_group(&self, group: &group::Group, stringtag: &str) -> Result<Communicator> {
        let c_tag = CString::new(stringtag)
            .map_err(|_| Error::Internal("stringtag contains null byte".into()))?;
        if !Self::supports_create_from_group() {
            return Err(Error::NotSupported(
                "MPI_Comm_create_from_group".to_string(),
            ));
        }
        let mut new_handle: i32 = -1;
        // SAFETY: c_tag.as_ptr() is a valid, null-terminated C string that
        // lives for the duration of this call. group.handle is a valid group
        // handle obtained from ferrompi_comm_group or a group-constructor shim.
        // &mut new_handle is a pointer to a stack-allocated i32.
        let ret = unsafe {
            ffi::ferrompi_comm_create_from_group(group.handle, c_tag.as_ptr(), &mut new_handle)
        };
        Error::check_with_op(ret, "comm_create_from_group")?;
        Communicator::from_handle(new_handle)
    }

    /// Attach a user-provided buffer for use by buffered sends.
    ///
    /// This wraps `MPI_Buffer_attach`. Once attached, the buffer is owned by
    /// MPI until [`buffer_detach`](Self::buffer_detach) is called. Only one
    /// buffer may be attached per process at a time; attempting to attach a
    /// second buffer without first detaching returns
    /// `Err(`[`Error::InvalidOp`]`)`.
    ///
    /// The `buffer` is stored in a process-wide static so its allocation
    /// remains valid for the lifetime of the attachment. You must not access
    /// the raw bytes of `buffer` between `buffer_attach` and `buffer_detach` —
    /// MPI owns the contents during that window.
    ///
    /// # Buffer Sizing
    ///
    /// The recommended buffer size for `N` buffered sends of `count` elements
    /// of type `T` is:
    ///
    /// ```text
    /// N * (MPI_BSEND_OVERHEAD + count * size_of::<T>())
    /// ```
    ///
    /// `MPI_BSEND_OVERHEAD` is implementation-specific; use at least a few
    /// hundred extra bytes per buffered send. For safety, use a generous margin.
    ///
    /// Buffers larger than `i32::MAX` bytes are rejected with
    /// `Err(`[`Error::InvalidBuffer`]`)` before the FFI call; the
    /// underlying `MPI_Buffer_attach` takes an `int` count and cannot
    /// address larger buffers.
    ///
    /// # Errors
    ///
    /// - [`Error::InvalidBuffer`] if `buffer.len() > i32::MAX as usize`.
    /// - [`Error::InvalidOp`] if a buffer is already attached.
    /// - [`Error::Mpi`] if `MPI_Buffer_attach` fails.
    ///
    /// # Example
    ///
    /// ```no_run
    /// # use ferrompi::Mpi;
    /// let mpi = Mpi::init().unwrap();
    /// mpi.buffer_attach(vec![0u8; 64 * 1024].into_boxed_slice()).unwrap();
    /// // ... buffered sends here ...
    /// let _ = mpi.buffer_detach().unwrap();
    /// ```
    pub fn buffer_attach(&self, buffer: Box<[u8]>) -> Result<()> {
        let mut guard = ATTACHED_BUFFER
            .lock()
            .map_err(|_| Error::Internal("ATTACHED_BUFFER mutex poisoned".into()))?;
        if guard.is_some() {
            return Err(Error::InvalidOp);
        }
        if buffer.len() > i32::MAX as usize {
            return Err(Error::InvalidBuffer);
        }
        let ptr = buffer.as_ptr() as *mut std::ffi::c_void;
        let size = buffer.len() as i64;
        // Store the box in the static BEFORE calling MPI so the memory is
        // guaranteed alive when MPI begins using the buffer.
        *guard = Some(buffer);
        // SAFETY: ptr points to the boxed slice we just stored in the static;
        // the allocation remains valid until buffer_detach drops it. size is
        // the exact byte length of that allocation.
        let ret = unsafe { ffi::ferrompi_buffer_attach(ptr, size) };
        if ret != 0 {
            // Roll back: reclaim the box so the caller can retry.
            guard.take();
            return Err(Error::from_code_with_op(ret, "buffer_attach"));
        }
        Ok(())
    }

    /// Detach the previously attached buffer and return it to the caller.
    ///
    /// This wraps `MPI_Buffer_detach`. The call **blocks** until all buffered
    /// sends that are currently using the buffer have completed. Once this
    /// returns, the returned `Box<[u8]>` is owned by the caller again and
    /// may be dropped or reused.
    ///
    /// Do **not** call this from a `Drop` implementation (e.g., on a wrapper
    /// around `Mpi`). `MPI_Buffer_detach` blocks until pending sends drain;
    /// blocking inside `Drop` can produce hard-to-diagnose hangs.
    ///
    /// # Errors
    ///
    /// - [`Error::InvalidOp`] if no buffer is currently attached.
    /// - [`Error::Mpi`] if `MPI_Buffer_detach` fails.
    ///
    /// # Example
    ///
    /// ```no_run
    /// # use ferrompi::Mpi;
    /// let mpi = Mpi::init().unwrap();
    /// mpi.buffer_attach(vec![0u8; 64 * 1024].into_boxed_slice()).unwrap();
    /// // ... buffered sends ...
    /// let _buf = mpi.buffer_detach().unwrap(); // dropped here
    /// ```
    pub fn buffer_detach(&self) -> Result<Box<[u8]>> {
        let mut guard = ATTACHED_BUFFER
            .lock()
            .map_err(|_| Error::Internal("ATTACHED_BUFFER mutex poisoned".into()))?;
        if guard.is_none() {
            return Err(Error::InvalidOp);
        }
        let mut out_ptr: *mut std::ffi::c_void = std::ptr::null_mut();
        let mut out_size: i64 = 0;
        // SAFETY: out_ptr and out_size are valid stack-allocated output parameters.
        // MPI_Buffer_detach writes the buffer pointer and its size into them.
        let ret = unsafe { ffi::ferrompi_buffer_detach(&mut out_ptr, &mut out_size) };
        if ret != 0 {
            return Err(Error::from_code_with_op(ret, "buffer_detach"));
        }
        // Reclaim the Box from the static; this is the same allocation MPI just
        // released. We take() here so the static is cleared atomically.
        let buf = guard.take().expect("guard was Some; take() must succeed");
        Ok(buf)
    }
}

impl Drop for Mpi {
    fn drop(&mut self) {
        // Only finalize if we successfully initialized
        if MPI_INITIALIZED.load(Ordering::SeqCst) {
            unsafe {
                ffi::ferrompi_finalize();
            }
            MPI_INITIALIZED.store(false, Ordering::SeqCst);
        }
    }
}

// Mpi is not Send or Sync - MPI must be used from the thread that initialized it
// (unless thread level is Multiple)
// This is enforced by PhantomData<*const ()> in the struct

#[cfg(test)]
mod tests {
    // Note: MPI tests must be run with mpiexec
    // cargo build --examples && mpiexec -n 4 ./target/debug/examples/hello_world

    use super::*;

    // ── ThreadLevel tests ──────────────────────────────────────────────

    #[test]
    fn thread_level_ordering() {
        assert!(ThreadLevel::Single < ThreadLevel::Funneled);
        assert!(ThreadLevel::Funneled < ThreadLevel::Serialized);
        assert!(ThreadLevel::Serialized < ThreadLevel::Multiple);
    }

    #[test]
    fn thread_level_equality() {
        assert_eq!(ThreadLevel::Single, ThreadLevel::Single);
        assert_eq!(ThreadLevel::Funneled, ThreadLevel::Funneled);
        assert_eq!(ThreadLevel::Serialized, ThreadLevel::Serialized);
        assert_eq!(ThreadLevel::Multiple, ThreadLevel::Multiple);
        assert_ne!(ThreadLevel::Single, ThreadLevel::Multiple);
        assert_ne!(ThreadLevel::Funneled, ThreadLevel::Serialized);
    }

    #[test]
    fn thread_level_repr_values() {
        assert_eq!(ThreadLevel::Single as i32, 0);
        assert_eq!(ThreadLevel::Funneled as i32, 1);
        assert_eq!(ThreadLevel::Serialized as i32, 2);
        assert_eq!(ThreadLevel::Multiple as i32, 3);
    }

    #[test]
    fn thread_level_debug_clone() {
        let level = ThreadLevel::Funneled;
        let cloned = level;
        assert_eq!(format!("{cloned:?}"), "Funneled");

        assert_eq!(format!("{:?}", ThreadLevel::Single), "Single");
        assert_eq!(format!("{:?}", ThreadLevel::Serialized), "Serialized");
        assert_eq!(format!("{:?}", ThreadLevel::Multiple), "Multiple");
    }

    // ── ReduceOp tests ─────────────────────────────────────────────────

    #[test]
    fn reduce_op_repr_values() {
        let ops = [
            (ReduceOp::Sum, 0),
            (ReduceOp::Max, 1),
            (ReduceOp::Min, 2),
            (ReduceOp::Prod, 3),
            (ReduceOp::BitwiseOr, 4),
            (ReduceOp::BitwiseAnd, 5),
            (ReduceOp::BitwiseXor, 6),
            (ReduceOp::LogicalOr, 7),
            (ReduceOp::LogicalAnd, 8),
            (ReduceOp::LogicalXor, 9),
            (ReduceOp::MaxLoc, 10),
            (ReduceOp::MinLoc, 11),
        ];
        for (op, expected) in ops {
            assert_eq!(op as i32, expected);
        }
        #[cfg(feature = "rma")]
        {
            assert_eq!(ReduceOp::Replace as i32, 12);
            assert_eq!(ReduceOp::NoOp as i32, 13);
        }
    }

    #[test]
    fn reduce_op_equality() {
        assert_eq!(ReduceOp::Sum, ReduceOp::Sum);
        assert_eq!(ReduceOp::Max, ReduceOp::Max);
        assert_eq!(ReduceOp::Min, ReduceOp::Min);
        assert_eq!(ReduceOp::Prod, ReduceOp::Prod);
        assert_eq!(ReduceOp::BitwiseOr, ReduceOp::BitwiseOr);
        assert_eq!(ReduceOp::BitwiseAnd, ReduceOp::BitwiseAnd);
        assert_eq!(ReduceOp::BitwiseXor, ReduceOp::BitwiseXor);
        assert_eq!(ReduceOp::LogicalOr, ReduceOp::LogicalOr);
        assert_eq!(ReduceOp::LogicalAnd, ReduceOp::LogicalAnd);
        assert_eq!(ReduceOp::LogicalXor, ReduceOp::LogicalXor);
        assert_eq!(ReduceOp::MaxLoc, ReduceOp::MaxLoc);
        assert_eq!(ReduceOp::MinLoc, ReduceOp::MinLoc);
        assert_ne!(ReduceOp::Sum, ReduceOp::Max);
        assert_ne!(ReduceOp::Min, ReduceOp::Prod);
        assert_ne!(ReduceOp::Sum, ReduceOp::Prod);
        assert_ne!(ReduceOp::BitwiseOr, ReduceOp::BitwiseAnd);
        assert_ne!(ReduceOp::LogicalOr, ReduceOp::LogicalAnd);
        assert_ne!(ReduceOp::Sum, ReduceOp::BitwiseOr);
        assert_ne!(ReduceOp::MaxLoc, ReduceOp::MinLoc);
        assert_ne!(ReduceOp::MaxLoc, ReduceOp::Max);
    }

    #[test]
    fn reduce_op_debug_clone() {
        let op = ReduceOp::Sum;
        let cloned = op;
        assert_eq!(format!("{cloned:?}"), "Sum");

        assert_eq!(format!("{:?}", ReduceOp::Max), "Max");
        assert_eq!(format!("{:?}", ReduceOp::Min), "Min");
        assert_eq!(format!("{:?}", ReduceOp::Prod), "Prod");
        assert_eq!(format!("{:?}", ReduceOp::BitwiseOr), "BitwiseOr");
        assert_eq!(format!("{:?}", ReduceOp::BitwiseAnd), "BitwiseAnd");
        assert_eq!(format!("{:?}", ReduceOp::BitwiseXor), "BitwiseXor");
        assert_eq!(format!("{:?}", ReduceOp::LogicalOr), "LogicalOr");
        assert_eq!(format!("{:?}", ReduceOp::LogicalAnd), "LogicalAnd");
        assert_eq!(format!("{:?}", ReduceOp::LogicalXor), "LogicalXor");
        assert_eq!(format!("{:?}", ReduceOp::MaxLoc), "MaxLoc");
        assert_eq!(format!("{:?}", ReduceOp::MinLoc), "MinLoc");
    }

    #[test]
    fn reduce_op_all_variants_match_c_switch() {
        let variants = [
            (ReduceOp::Sum, 0i32),
            (ReduceOp::Max, 1),
            (ReduceOp::Min, 2),
            (ReduceOp::Prod, 3),
            (ReduceOp::BitwiseOr, 4),
            (ReduceOp::BitwiseAnd, 5),
            (ReduceOp::BitwiseXor, 6),
            (ReduceOp::LogicalOr, 7),
            (ReduceOp::LogicalAnd, 8),
            (ReduceOp::LogicalXor, 9),
            (ReduceOp::MaxLoc, 10),
            (ReduceOp::MinLoc, 11),
        ];
        for (op, expected) in variants {
            assert_eq!(op as i32, expected);
        }
        #[cfg(feature = "rma")]
        {
            assert_eq!(ReduceOp::Replace as i32, 12);
            assert_eq!(ReduceOp::NoOp as i32, 13);
        }
    }

    #[cfg(feature = "rma")]
    #[test]
    fn replace_noop_discriminants() {
        assert_eq!(ReduceOp::Replace as i32, 12);
        assert_eq!(ReduceOp::NoOp as i32, 13);
    }

    // ── Mpi::create_from_group unit tests ─────────────────────────────────

    // ── Mpi::buffer_attach / buffer_detach unit tests ─────────────────────

    /// Documentation-anchor for the `Error::InvalidBuffer` contract on
    /// the oversize path. The functional guard at `src/lib.rs` (search
    /// `i32::MAX as usize` inside `buffer_attach`) cannot be invoked from
    /// a unit test without `MPI_Init`; behavioral verification requires an
    /// integration example (deferred to epic-06 follow-up). This test
    /// only witnesses that `Error::InvalidBuffer` is a valid variant.
    #[test]
    fn buffer_attach_invalid_buffer_variant_exists() {
        let err = Error::InvalidBuffer;
        assert!(matches!(err, Error::InvalidBuffer));
    }

    /// Compile-time witness that buffer_attach and buffer_detach have the
    /// correct signatures. No MPI runtime is needed — the functions are
    /// never called.
    #[test]
    fn buffer_attach_signature_compiles() {
        fn _check(mpi: &Mpi, buf: Box<[u8]>) -> Result<()> {
            mpi.buffer_attach(buf)
        }
        fn _check_detach(mpi: &Mpi) -> Result<Box<[u8]>> {
            mpi.buffer_detach()
        }
    }

    /// Calling buffer_attach twice (without a detach in between) must return
    /// Err(Error::InvalidOp).  We test against the static ATTACHED_BUFFER
    /// directly by calling the method twice with a stub Mpi.  The first call
    /// will reach MPI (and may fail for various reasons in a non-MPI test
    /// environment), so we only rely on the second call returning InvalidOp.
    /// To avoid touching MPI at all we seed the static manually.
    #[test]
    fn buffer_attach_double_attach_returns_invalid_op() {
        // Seed the static to simulate an already-attached buffer.
        {
            let mut g = ATTACHED_BUFFER.lock().unwrap();
            if g.is_none() {
                *g = Some(vec![0u8; 4].into_boxed_slice());
            }
        }

        let mpi = Mpi {
            thread_level: ThreadLevel::Single,
            _marker: PhantomData,
        };
        let buf2 = vec![0u8; 8].into_boxed_slice();
        let result = mpi.buffer_attach(buf2);
        assert!(
            matches!(result, Err(Error::InvalidOp)),
            "expected Err(InvalidOp) on double attach, got: {result:?}"
        );

        // Clean up: remove the seeded entry so other tests are unaffected.
        ATTACHED_BUFFER.lock().unwrap().take();
    }

    /// Calling buffer_detach with no buffer attached must return
    /// Err(Error::InvalidOp).
    #[test]
    fn buffer_detach_without_attach_returns_invalid_op() {
        // Ensure the static is empty.
        {
            let mut g = ATTACHED_BUFFER.lock().unwrap();
            *g = None;
        }

        let mpi = Mpi {
            thread_level: ThreadLevel::Single,
            _marker: PhantomData,
        };
        let result = mpi.buffer_detach();
        assert!(
            matches!(result, Err(Error::InvalidOp)),
            "expected Err(InvalidOp) on detach without attach, got: {result:?}"
        );
    }

    // ── Mpi::create_from_group unit tests ─────────────────────────────────

    /// Verify that `supports_create_from_group` is callable and has the
    /// expected function type. The probe-failure-not-cached invariant is
    /// enforced structurally: the implementation uses `get()` / `set()`
    /// rather than `get_or_init`, which is verified by the acceptance grep.
    /// Behavioural verification (call before init, then after init) requires
    /// a running MPI environment and is documented as a manual test scenario.
    #[test]
    fn supports_create_from_group_does_not_cache_probe_failures() {
        let _: fn() -> bool = Mpi::supports_create_from_group;
    }

    /// Verify that a `stringtag` containing a null byte is rejected before
    /// the FFI call is ever invoked.  We test the null-byte path directly
    /// by calling the public method on a `Group` stub with handle 0
    /// (MPI_GROUP_EMPTY); the early-return on bad tag means MPI is never
    /// touched, so no running MPI environment is needed.
    #[test]
    fn create_from_group_null_byte_in_tag() {
        // Construct a minimal stub Mpi to call the method (no MPI calls made
        // because the null-byte check fires before the version probe or FFI).
        // We bypass init by constructing the struct directly — this is valid
        // inside the crate's own test module where the fields are accessible.
        let mpi = Mpi {
            thread_level: ThreadLevel::Single,
            _marker: PhantomData,
        };
        // Group with handle 0 (MPI_GROUP_EMPTY sentinel) — never dereferenced
        // because the null-byte check fires first.
        let g = group::Group { handle: 0 };
        let result = mpi.create_from_group(&g, "bad\0tag");
        match result {
            Err(Error::Internal(msg)) => {
                assert!(
                    msg.contains("null byte"),
                    "expected 'null byte' in error message, got: {msg}"
                );
            }
            Ok(_) => panic!("expected Err(Error::Internal(_)), got Ok(_)"),
            Err(e) => panic!("expected Err(Error::Internal(_)), got Err({e})"),
        }
    }
}