ferrolearn-cluster 0.5.0

//! DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
//!
//! This module provides [`DBSCAN`], a density-based clustering algorithm
//! that groups together points that are closely packed together, marking
//! as outliers points that lie alone in low-density regions.
//!
//! # Algorithm
//!
//! 1. For each point, find all neighbors within distance `eps`.
//! 2. Points with at least `min_samples` neighbors (including themselves)
//!    are **core points**.
//! 3. Clusters are formed by connecting core points that are within `eps`
//!    of each other, and assigning border points to the cluster of their
//!    nearest core point.
//! 4. Points that are not reachable from any core point are labeled as
//!    **noise** (label = -1).
//!
//! # Notes
//!
//! DBSCAN does **not** implement [`Predict`](ferrolearn_core::Predict) — it
//! only labels the training data. Use the fitted labels directly.
//!
//! # Examples
//!
//! ```
//! use ferrolearn_cluster::DBSCAN;
//! use ferrolearn_core::Fit;
//! use ndarray::Array2;
//!
//! let x = Array2::from_shape_vec((6, 2), vec![
//!     1.0, 1.0,  1.1, 1.0,  1.0, 1.1,
//!     5.0, 5.0,  5.1, 5.0,  5.0, 5.1,
//! ]).unwrap();
//!
//! let model = DBSCAN::<f64>::new(1.0);
//! let fitted = model.fit(&x, &()).unwrap();
//! let labels = fitted.labels();
//! assert_eq!(labels.len(), 6);
//! ```
//!
//! # `## REQ status`
//!
//! Binary classification (R-DEFER-2): two states only — SHIPPED needs impl + a
//! non-test production consumer + green verification + symbol-anchor + sklearn
//! `file:line`; NOT-STARTED carries the open prereq blocker. **Unlike the
//! `feature_agglomeration.rs` / `spectral.rs` siblings, `DBSCAN` HAS a real PyO3
//! binding** — `#[pyclass(name = "_RsDBSCAN")] struct RsDBSCAN` in
//! `ferrolearn-python/src/extras.rs` (its `fn fit` calls
//! `ferrolearn_cluster::DBSCAN::<f64>::new(self.eps).with_min_samples(self.min_samples)`
//! and `#[getter] labels_` surfaces `f.labels()`), registered via
//! `m.add_class::<extras::RsDBSCAN>()` in `ferrolearn-python/src/lib.rs`, surfaced
//! as `ferrolearn.DBSCAN` (`fit_predict` / `labels_`). So
//! `import ferrolearn; ferrolearn.DBSCAN(...).fit(X).labels_` is the non-test
//! production consumer of the core contract. **Honest assessment (R-HONEST-3):
//! this unit's CORE contract genuinely SHIPS** — DBSCAN is deterministic (no RNG,
//! no iterative optimizer), so on the Euclidean / no-`sample_weight` path
//! `labels_` and `core_sample_indices_` VALUE-match the live sklearn 1.5.2 oracle
//! EXACTLY (element-wise, including the shared-border tie-break and the noise
//! `-1` set), backed by green guards in
//! `ferrolearn-cluster/tests/divergence_dbscan.rs`. The divergence appears ONLY
//! on a neighbor edge where ferrolearn's `sum-of-squares ≤ eps²` and sklearn's
//! `euclidean_distances` (dot-product) / tree distance round to OPPOSITE SIDES of
//! the `eps` boundary — e.g. coords whose true distance is within a ULP of `eps`
//! (the two forms then disagree on inclusion, flipping core status / merging
//! clusters). When the boundary value is exactly representable the two agree
//! (the green guards include Fixture C with edges exactly at `eps=1.0`, where
//! `1.0²` is exact and both include). This exact-boundary distance-form parity is
//! carved out as REQ-11 (NOT-STARTED, #952), tied to the neighbor-search
//! algorithm surface (#949). The NOT-STARTED surface is the
//! unimplemented parameter/attribute surface (`metric`/`p`,
//! `algorithm`/`leaf_size`/`n_jobs`, the `eps=0.5` Rust-constructor default + error
//! ABI, the ferray substrate), each carrying an open
//! prereq blocker. Cites use symbol anchors (ferrolearn) / `file:line`
//! (sklearn 1.5.2, commit 156ef14). Live oracle = installed sklearn 1.5.2. (REQ
//! numbering follows `.design/cluster/dbscan.md`.)
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 (`labels_` VALUE parity, Euclidean / no `sample_weight`) | SHIPPED | impl `Fit::fit` (index-ordered cluster expansion via `VecDeque` BFS) + `fn fit_predict` in `dbscan.rs` mirror `dbscan_inner` (`sklearn/cluster/_dbscan_inner.pyx`) + `_dbscan.py:431-439` (`labels = np.full(n, -1)`, `dbscan_inner(...)`): clusters numbered by the first unlabeled core in index order, noise `= -1`, a border point reachable from two clusters joins the FIRST-reaching cluster (guard `if labels[neighbor] == -1`). VALUE-matches sklearn EXACTLY element-wise on the tested fixtures; diverges only when ferrolearn's `sum-sq ≤ eps²` and sklearn's `euclidean_distances`/tree distance round to opposite sides of the `eps` boundary (true distance within a ULP of `eps`) — carved out as REQ-11 (NOT-STARTED, #952). Non-test consumer: `ferrolearn.DBSCAN.fit(X).labels_` via `RsDBSCAN::labels_` (`ferrolearn-python/src/extras.rs`). Green guards (`tests/divergence_dbscan.rs`): `green_dbscan_labels_two_clusters`, `green_dbscan_labels_three_clusters`, `green_dbscan_shared_border_tie`, `green_dbscan_noise_and_core_indices`. |
//! | REQ-2 (`core_sample_indices_` VALUE parity) | SHIPPED | impl `core_sample_indices = (0..n).filter(\|&i\| is_core[i]).collect()` in `Fit::fit` + accessor `fn core_sample_indices` in `dbscan.rs`, mirroring `core_sample_indices_ = np.where(core_samples)[0]` (`_dbscan.py:438`), core `= neighborhoods[i].len() >= min_samples` (`:435`). VALUE-matches ascending EXACTLY on non-eps-boundary inputs (the eps-boundary case flips a point's core status — REQ-11/#952). Non-test consumer: crate re-export `pub use dbscan::{DBSCAN, FittedDBSCAN}` (`ferrolearn-cluster/src/lib.rs`) — the in-crate accessor (the PyO3 layer surfaces only `labels_`, not `core_sample_indices_`). Green guards: `green_dbscan_shared_border_tie` (Fixture C core `[0,1,2,3,5,6,7,8]`, border idx4 excluded), `green_dbscan_noise_and_core_indices` (Fixture D core `[0,1,2,4,6,7,8,9,13]`). |
//! | REQ-3 (`eps>0` / `min_samples>=1` validation boundary) | SHIPPED | impl `Fit::fit` guards `self.eps <= F::zero()` → `Err(FerroError::InvalidParameter { name: "eps" })` and `self.min_samples == 0` → `Err(FerroError::InvalidParameter { name: "min_samples" })`, matching the accept/reject boundary of `_parameter_constraints` `"eps": [Interval(Real, 0.0, None, closed="neither")]` / `"min_samples": [Interval(Integral, 1, None, closed="left")]` (`_dbscan.py:332-333`). Non-test consumer: `RsDBSCAN::fit` maps the error to `PyValueError` (`ferrolearn-python/src/extras.rs`). Green guards: `green_dbscan_eps_zero_rejected` (`eps=0` → `Err`), `green_dbscan_min_samples_one_self_core` (`min_samples=1` accepted, self-core). GAP: the error TYPE is `FerroError::InvalidParameter`, not sklearn's `InvalidParameterError`/`ValueError` ABI — the boundary itself matches; the ABI nuance is tracked under REQ-4. |
//! | REQ-4 (`eps=0.5` Rust-constructor default + sklearn error ABI) | NOT-STARTED | open prereq blocker **#946**. sklearn `__init__` `eps=0.5` default (`_dbscan.py:347`); ferrolearn `fn new(eps: F)` REQUIRES `eps` — no default. (The PyO3 layer DOES default `eps=0.5` in `RsDBSCAN::new`, but the Rust constructor `DBSCAN::new` does not.) Also validation errors are `FerroError::InvalidParameter`, not the sklearn `InvalidParameterError`/`ValueError` ABI (R-DEV-2). |
//! | REQ-5 (`sample_weight` — alters core determination) | SHIPPED | impl: `DBSCAN<F>` gains `pub sample_weight: Option<Array1<F>>` + builder `fn with_sample_weight` (`dbscan.rs`); `Fit::fit` validates `w.len() == n_samples` (else `FerroError::ShapeMismatch { context: "sample_weight" }`, mirroring `_check_sample_weight`'s shape `ValueError`, `sklearn/utils/validation.py:2055-2060`) then, when `Some(w)`, sets the core mask `is_core[i] = (sum_{j in neighborhoods[i]} w[j]) >= F::from(min_samples)` — a FLOAT compare over the SAME self-including neighborhoods — mirroring `n_neighbors = [sum(sample_weight[neighbors])...]` + `core_samples = n_neighbors >= self.min_samples` (`_dbscan.py:427-435`). `None` keeps the byte-for-byte unchanged `len >= min_samples` path. Downstream BFS / `core_sample_indices_` / `components_` / `labels_` consume the new mask UNCHANGED. Negative/zero weights accepted (DBSCAN does not pass `only_non_negative`). Non-test consumer: crate re-export `pub use dbscan::{DBSCAN, FittedDBSCAN}` (`ferrolearn-cluster/src/lib.rs`) — the in-crate builder/accessors (the PyO3 `sample_weight` Python surface stays a REQ-9 follow-on, matching how `core_sample_indices_`/`components_` already ship through the re-export, not the binding). Live-oracle guards (`tests/divergence_dbscan.rs`, R-CHAR-3): `green_dbscan_weighted_flips_isolated_to_core` (`w=[5,1,1]` → `[0,-1,-1]` core `[0]` vs unweighted `[-1,-1,-1]` core `[]`), `green_dbscan_weighted_demotes_core_to_noise` (`w=[0.5;4]`, `min_samples=4` → all noise vs unweighted one cluster), `green_dbscan_all_ones_equals_unweighted` (`w=[1;8]` reproduces Fixture-A `[0,0,0,0,1,1,1,1]`/core `[0..8]` EXACTLY), `green_dbscan_fractional_weight_boundary` (`w=[0.5;8]`: `8*0.5=4.0 >= 4` core / `< 5` noise — the float `>=` boundary), `green_dbscan_sample_weight_wrong_length_errs` (`Err`, no panic). |
//! | REQ-6 (`metric` / `p`) | SHIPPED | impl: `pub enum DbscanMetric<F> { Euclidean, Manhattan, Minkowski(F), Chebyshev }` + field `pub metric: DbscanMetric<F>` (default `Euclidean`) + builders `fn with_metric` / `fn with_p` on `DBSCAN<F>` (`dbscan.rs`); `Fit::fit` resolves the metric into `enum MetricThreshold` and `fn region_query` branches on it — Euclidean keeps the UNCHANGED `sum_sq <= eps*eps` (byte-for-byte default), Manhattan `sum\|dx\| <= eps`, Chebyshev `max\|dx\| <= eps`, Minkowski(p) the POWERED form `sum\|dx\|^p <= eps.powf(p)` (no outer `^(1/p)` root; `p==1` collapses to Manhattan, `p==2` to the exact Euclidean sum-sq form). Mirrors sklearn's `NearestNeighbors(metric=..., p=...).radius_neighbors` neighbor test `dist <= eps` (`_dbscan.py:411-422`), metric formulas `sklearn/metrics/_dist_metrics.pyx.tp:102-105`, default `'euclidean'` (`_dbscan.py:350`), `p=None`→`p=2` (`_dbscan.py:354`). `p <= 0` / NaN → `Err(FerroError::InvalidParameter { name: "p" })` (no panic), mirroring sklearn's `NearestNeighbors` `p in (0, inf]` `InvalidParameterError`. Non-test consumer: crate re-export `pub use dbscan::{DBSCAN, FittedDBSCAN}` (`ferrolearn-cluster/src/lib.rs`) surfaces the `with_metric`/`with_p` builders + public `metric` field (the enum is reachable via the `pub mod dbscan`; the PyO3 `metric`/`p` Python surface stays a follow-on, like `sample_weight`/REQ-5). Live-oracle guards (`tests/divergence_dbscan.rs`, R-CHAR-3): `green_dbscan_manhattan_vs_euclidean_labels` (Fixture M: manhattan `[-1,0,0]`/core `[1,2]` vs euclidean `[0,0,0]`/`[0,1,2]`), `green_dbscan_default_euclidean_equals_no_metric` (explicit Euclidean == no-metric, default unchanged), `green_dbscan_minkowski_p1_p2_collapse` (p=1==manhattan, p=2==euclidean, both == sklearn), `green_dbscan_chebyshev_and_minkowski_p3` (Fixture N: chebyshev & p=3 `[0,0,0,0]`/`[0,1,2,3]` vs euclidean `[-1,0,0,-1]`/`[1,2]`), `green_dbscan_minkowski_nonpositive_p_rejected` (p=-1/0 → `Err`, no panic). GAP: `metric_params`, `metric='precomputed'`, callable metric stay NOT-STARTED — `metric_params` (per-metric kwargs like Minkowski `w`/seuclidean `V`) has no ferrolearn surface (only scalar `p` is plumbed); `precomputed`/callable need a distance-matrix / closure input path (#948 scope). The #952 exact-eps-boundary distance-form caveat (REQ-11) PERSISTS per metric: each metric's `<=` compares ferrolearn's powered form against the powered `eps`, which can round to the opposite side of sklearn's tree/`pairwise_distances` computation for an edge whose true distance is within a ULP of `eps`. |
//! | REQ-7 (`algorithm` / `leaf_size` / `n_jobs`) | NOT-STARTED | open prereq blocker **#949**. sklearn routes neighbor search through `NearestNeighbors(radius, algorithm, leaf_size, ..., n_jobs)` (`_dbscan.py:411-422`; constraints `:339-342`). ferrolearn uses a fixed brute-force `O(n^2)` `fn region_query` with no parameter. (Brute force value-matches the default `'auto'`; the divergence is the absent parameter surface.) Missing surface. |
//! | REQ-8 (`components_` fitted attribute) | SHIPPED | impl: `Fit::fit` builds `components_: Array2<F>` of shape `(core_sample_indices.len(), n_features)` whose row `k` is `x.row(core_sample_indices[k])`, stored in `FittedDBSCAN`, surfaced via accessor `fn components` in `dbscan.rs` — mirroring `self.components_ = X[self.core_sample_indices_].copy()` (`_dbscan.py:441-446`). VALUE-matches the live sklearn 1.5.2 oracle (`DBSCAN(eps=0.5, min_samples=3)` on the 7-point fixture → `components_` shape `(6,2)` = `[[1,1],[1.2,1.1],[0.9,1.0],[8,8],[8.1,8.2],[8.0,7.9]]`, the rows of `X` at `core_sample_indices_ = [0,1,2,3,4,5]`). Consumer: crate re-export `pub use dbscan::{DBSCAN, FittedDBSCAN}` (`ferrolearn-cluster/src/lib.rs`) — the in-crate accessor (the PyO3 layer does not yet re-expose `components_`, see REQ-9). Green guard: `dbscan_components_match_sklearn` (`dbscan.rs` tests). |
//! | REQ-9 (PyO3 binding VALUE parity) | SHIPPED | impl `#[pyclass(name = "_RsDBSCAN")] RsDBSCAN` (`ferrolearn-python/src/extras.rs`): `fn new(eps=0.5, min_samples=5)`, `fn fit` calling `ferrolearn_cluster::DBSCAN::<f64>::new(self.eps).with_min_samples(self.min_samples)`, `#[getter] labels_`; registered via `m.add_class::<extras::RsDBSCAN>()` (`ferrolearn-python/src/lib.rs`), surfaced as `ferrolearn.DBSCAN` (`fit_predict` / `labels_`). Non-test consumer: `import ferrolearn; ferrolearn.DBSCAN(...).fit(X).labels_`. Since the Rust core value-matches (REQ-1/2), `import ferrolearn` matches `import sklearn` on the Euclidean / no-`sample_weight` path. The binding does NOT yet re-expose `core_sample_indices_` / `components_` / `sample_weight` / `metric` (only `labels_` via `#[getter]`) — those ride their own REQs. |
//! | REQ-10 (ferray substrate) | NOT-STARTED | open prereq blocker **#951**. `dbscan.rs` imports `ndarray::{Array1, Array2}` + `num_traits::Float` + `std::collections::VecDeque`; not migrated to `ferray-core` (R-SUBSTRATE-1/2). The PyO3 boundary uses `numpy2_to_ndarray` (`extras.rs`), an `ndarray` bridge, not `ferray::numpy_interop`. |
//! | REQ-11 (exact eps-boundary neighbor parity) | NOT-STARTED | open prereq blocker **#952**. `fn region_query` (`dbscan.rs`) includes a neighbor when `squared_euclidean ≤ eps*eps`; sklearn routes through `NearestNeighbors(radius=eps).radius_neighbors` (`_dbscan.py:411-422`) whose `euclidean_distances`/tree distance computation rounds differently at the exact boundary. For an edge whose true distance ≈ `eps` (e.g. exactly `1.3` with `eps=1.3`, or `0.9999999999999998` with `eps=1.0`), the two disagree on inclusion → flips core status / merges clusters, diverging from the oracle `labels_`/`core_sample_indices_`. Reproducing sklearn's exact boundary requires its `euclidean_distances` dot-product rounding AND its `algorithm`-dependent neighbor search (#949) — not a single-file fix. Documented (no committed failing test per R-DEFER-6 / the kdtree #831 + balltree #858 convention). |
//! | REQ-12 (reject non-finite input) | SHIPPED | impl `Fit::fit` checks `x.iter().any(\|v\| !v.is_finite())` up front (after the `eps`/`min_samples` param checks, before metric resolution / `region_query`) and returns `Err(FerroError::InvalidParameter{name:"X"})`, rejecting NaN AND infinity. Mirrors sklearn's `_validate_data(X, accept_sparse="csr")` (`force_all_finite=True` default) reached from `DBSCAN.fit` (`_dbscan.py:395`), which raises `ValueError` (`validation.py:147-154`). Consumers: the existing `fit` entry — PyO3 `RsDBSCAN::fit` (`extras.rs`) + crate re-export `pub use dbscan::{DBSCAN, FittedDBSCAN}` (`lib.rs`). Pinned by `divergence_nonfinite_reject.rs` (`divergence_dbscan_fit_rejects_nan/_inf`) — live sklearn 1.5.2 raises, ferrolearn now `Err`. Finite input byte-identical (the module's oracle pins stay green). |

use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::Fit;
use ndarray::{Array1, Array2};
use num_traits::Float;
use std::collections::VecDeque;

/// NumPy-compatible pairwise summation, byte-for-byte mirroring `np.sum` over a
/// 1-D contiguous array (NumPy `pairwise_sum`,
/// `numpy/_core/src/umath/loops_utils.h.src:81-145`).
///
/// DBSCAN's weighted core determination compares `np.sum(sample_weight[neighbors])`
/// against `min_samples` (`sklearn/cluster/_dbscan.py:428-435`). A sequential
/// left-fold rounds differently from NumPy's pairwise reduction (e.g. ten `0.1`
/// weights sum to exactly `1.0` under pairwise but `0.999…` sequentially),
/// flipping the float `>=` boundary, so the weight sum must use this exact
/// reduction order (#2189).
fn numpy_pairwise_sum<F: Float>(a: &[F]) -> F {
    let n = a.len();
    if n < 8 {
        // Start from -0.0 to preserve `-0` semantics (numpy `:86-89`).
        let mut res = F::neg_zero();
        for &v in a {
            res = res + v;
        }
        res
    } else if n <= 128 {
        // Eight-accumulator unrolled block (numpy `:96-135`).
        let mut r = [a[0], a[1], a[2], a[3], a[4], a[5], a[6], a[7]];
        let limit = n - (n % 8);
        let mut i = 8;
        while i < limit {
            r[0] = r[0] + a[i];
            r[1] = r[1] + a[i + 1];
            r[2] = r[2] + a[i + 2];
            r[3] = r[3] + a[i + 3];
            r[4] = r[4] + a[i + 4];
            r[5] = r[5] + a[i + 5];
            r[6] = r[6] + a[i + 6];
            r[7] = r[7] + a[i + 7];
            i += 8;
        }
        let mut res = ((r[0] + r[1]) + (r[2] + r[3])) + ((r[4] + r[5]) + (r[6] + r[7]));
        // Non-multiple-of-8 remainder (numpy `:131-134`).
        while i < n {
            res = res + a[i];
            i += 1;
        }
        res
    } else {
        // Divide and conquer, n2 rounded down to a multiple of 8 (numpy `:137-143`).
        let mut n2 = n / 2;
        n2 -= n2 % 8;
        numpy_pairwise_sum(&a[..n2]) + numpy_pairwise_sum(&a[n2..])
    }
}

/// The neighbor distance metric used by [`DBSCAN`] to decide whether two
/// samples are within `eps` of each other.
///
/// A point `j` is a neighbor of point `i` iff `dist(i, j) <= eps` under the
/// chosen metric, mirroring scikit-learn's
/// `NearestNeighbors(radius=eps, metric=..., p=...).radius_neighbors`
/// (`sklearn/cluster/_dbscan.py:411-422`). The metric names and formulas
/// mirror `sklearn.metrics.DistanceMetric`
/// (`sklearn/metrics/_dist_metrics.pyx.tp:102-105`):
///
/// - [`Euclidean`](DbscanMetric::Euclidean) (`'euclidean'`/`'l2'`):
///   `sqrt(sum((xi - yi)^2))`.
/// - [`Manhattan`](DbscanMetric::Manhattan) (`'manhattan'`/`'l1'`/`'cityblock'`):
///   `sum(|xi - yi|)`.
/// - [`Minkowski(p)`](DbscanMetric::Minkowski) (`'minkowski'` with `p`):
///   `(sum(|xi - yi|^p))^(1/p)`; `p = 1` is Manhattan, `p = 2` is Euclidean.
///   In scikit-learn, `metric='minkowski'` with `p=None` means `p = 2`
///   (Euclidean) (`sklearn/cluster/_dbscan.py:354`); construct
///   [`Minkowski`](DbscanMetric::Minkowski) with the resolved `p` value.
/// - [`Chebyshev`](DbscanMetric::Chebyshev) (`'chebyshev'`/`'l_inf'`):
///   `max(|xi - yi|)`.
///
/// The default is [`Euclidean`](DbscanMetric::Euclidean), so
/// [`DBSCAN::new`] (and any model that never calls [`DBSCAN::with_metric`])
/// behaves identically to the metric-free implementation.
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum DbscanMetric<F> {
    /// Euclidean (L2) distance: `sqrt(sum((xi - yi)^2))`.
    Euclidean,
    /// Manhattan (L1 / cityblock) distance: `sum(|xi - yi|)`.
    Manhattan,
    /// Minkowski distance of order `p`: `(sum(|xi - yi|^p))^(1/p)`.
    ///
    /// `p = 1` is equivalent to [`Manhattan`](DbscanMetric::Manhattan) and
    /// `p = 2` to [`Euclidean`](DbscanMetric::Euclidean). The neighbor test
    /// compares the powered sum `sum(|xi - yi|^p)` against `eps^p` to avoid
    /// the outer `^(1/p)` root (matching sklearn's `dist <= eps`). `p` must be
    /// strictly positive; a non-positive `p` is rejected at [`Fit::fit`] time
    /// (sklearn's `NearestNeighbors` requires `p` in `(0, inf]`).
    Minkowski(F),
    /// Chebyshev (L-infinity) distance: `max(|xi - yi|)`.
    Chebyshev,
}

impl<F> Default for DbscanMetric<F> {
    /// The default metric is [`Euclidean`](DbscanMetric::Euclidean), matching
    /// scikit-learn's `metric='euclidean'` default
    /// (`sklearn/cluster/_dbscan.py:350`).
    fn default() -> Self {
        DbscanMetric::Euclidean
    }
}

/// DBSCAN clustering configuration (unfitted).
///
/// Holds hyperparameters for the DBSCAN algorithm. Call [`Fit::fit`]
/// to run the algorithm and produce a [`FittedDBSCAN`].
///
/// # Type Parameters
///
/// - `F`: The floating-point type (`f32` or `f64`).
#[derive(Debug, Clone)]
pub struct DBSCAN<F> {
    /// The maximum distance between two samples for them to be considered
    /// as in the same neighborhood.
    pub eps: F,
    /// The minimum number of samples in a neighborhood for a point to be
    /// considered a core point (including the point itself).
    pub min_samples: usize,
    /// Optional per-sample weights. When `Some(w)`, a point `i` is a core
    /// point iff the SUM of weights over its neighbors (including itself)
    /// is `>= min_samples`, mirroring sklearn's
    /// `n_neighbors[i] = sum(sample_weight[neighbors])`
    /// (`sklearn/cluster/_dbscan.py:427-429`). `None` is equivalent to all
    /// ones (the unweighted `len >= min_samples` path).
    pub sample_weight: Option<Array1<F>>,
    /// The neighbor distance metric. A sample `j` is a neighbor of `i` iff
    /// `dist(i, j) <= eps` under this metric, mirroring
    /// `NearestNeighbors(metric=..., p=...)` (`sklearn/cluster/_dbscan.py:411-422`).
    /// Defaults to [`DbscanMetric::Euclidean`] (sklearn `metric='euclidean'`,
    /// `_dbscan.py:350`); the Euclidean path is byte-for-byte identical to the
    /// metric-free implementation.
    pub metric: DbscanMetric<F>,
}

impl<F: Float> DBSCAN<F> {
    /// Create a new `DBSCAN` with the given `eps` radius.
    ///
    /// Uses default `min_samples = 5` and no sample weights.
    #[must_use]
    pub fn new(eps: F) -> Self {
        Self {
            eps,
            min_samples: 5,
            sample_weight: None,
            metric: DbscanMetric::Euclidean,
        }
    }

    /// Set the minimum number of samples for core points.
    #[must_use]
    pub fn with_min_samples(mut self, min_samples: usize) -> Self {
        self.min_samples = min_samples;
        self
    }

    /// Set the neighbor distance metric.
    ///
    /// Mirrors scikit-learn's `DBSCAN(metric=..., p=...)`: a sample `j` is a
    /// neighbor of `i` iff `dist(i, j) <= eps` under the chosen metric
    /// (`sklearn/cluster/_dbscan.py:411-422`). The default
    /// [`DbscanMetric::Euclidean`] reproduces the metric-free path exactly. For
    /// `metric='minkowski'` with `p`, pass [`DbscanMetric::Minkowski`] with the
    /// resolved `p` (sklearn's `p=None` means `p = 2`, i.e. Euclidean,
    /// `_dbscan.py:354`); the `p` value is validated to be strictly positive at
    /// [`Fit::fit`] time.
    #[must_use]
    pub fn with_metric(mut self, metric: DbscanMetric<F>) -> Self {
        self.metric = metric;
        self
    }

    /// Set the Minkowski order `p`.
    ///
    /// In scikit-learn `p` is a parameter ORTHOGONAL to `metric` that is used
    /// ONLY by the Minkowski metric and is IGNORED for every other metric
    /// (`sklearn/cluster/_dbscan.py:411-418`; e.g. `DBSCAN(metric='euclidean',
    /// p=3)` is plain Euclidean, not Minkowski, #2192). This mirrors that: it
    /// updates the order only when the current metric is
    /// [`DbscanMetric::Minkowski`], and is a no-op otherwise. To SELECT the
    /// Minkowski metric, use [`with_metric`](Self::with_metric) with
    /// [`DbscanMetric::Minkowski`] (`p = 1` is Manhattan, `p = 2` is Euclidean;
    /// `p` is validated strictly positive at [`Fit::fit`] time, matching
    /// `NearestNeighbors`' `p` in `(0, inf]`).
    #[must_use]
    pub fn with_p(mut self, p: F) -> Self {
        if matches!(self.metric, DbscanMetric::Minkowski(_)) {
            self.metric = DbscanMetric::Minkowski(p);
        }
        self
    }

    /// Set per-sample weights.
    ///
    /// Mirrors scikit-learn's `DBSCAN.fit(X, sample_weight=w)`: a sample with
    /// a weight of at least `min_samples` is by itself a core sample, and a
    /// sample with a negative weight may inhibit its eps-neighbors from being
    /// core (`sklearn/cluster/_dbscan.py:384-388`). Weights are absolute and
    /// default to 1 (passing all-ones reproduces the unweighted path exactly).
    /// Negative and zero weights are accepted (sklearn's `_check_sample_weight`
    /// does NOT require non-negativity for DBSCAN). The weight vector length is
    /// validated against the number of samples at [`Fit::fit`] time.
    #[must_use]
    pub fn with_sample_weight(mut self, w: Array1<F>) -> Self {
        self.sample_weight = Some(w);
        self
    }
}

/// Fitted DBSCAN model.
///
/// Stores the cluster labels and core sample indices from the training run.
/// Noise points are labeled with -1.
///
/// DBSCAN does **not** implement [`Predict`](ferrolearn_core::Predict).
#[derive(Debug, Clone)]
pub struct FittedDBSCAN<F> {
    /// Cluster label for each training sample. Noise points have label -1.
    labels_: Array1<isize>,
    /// Indices of core samples in the training data.
    core_sample_indices_: Vec<usize>,
    /// Coordinates of the core samples (rows of `X` at `core_sample_indices_`).
    pub(crate) components_: Array2<F>,
    /// Phantom data to retain the float type parameter.
    _marker: std::marker::PhantomData<F>,
}

impl<F: Float> FittedDBSCAN<F> {
    /// Return the cluster labels for the training data.
    ///
    /// Noise points have label `-1`.
    #[must_use]
    pub fn labels(&self) -> &Array1<isize> {
        &self.labels_
    }

    /// Return the indices of core samples in the training data.
    #[must_use]
    pub fn core_sample_indices(&self) -> &[usize] {
        &self.core_sample_indices_
    }

    /// Return the coordinates of the core samples, of shape
    /// `(n_core_samples, n_features)`.
    ///
    /// Row `k` is the training row `X[core_sample_indices()[k]]`, mirroring
    /// sklearn's `self.components_ = X[self.core_sample_indices_].copy()`
    /// (`sklearn/cluster/_dbscan.py:441-446`).
    #[must_use]
    pub fn components(&self) -> &Array2<F> {
        &self.components_
    }

    /// Return the number of clusters found (excluding noise).
    #[must_use]
    pub fn n_clusters(&self) -> usize {
        let max_label = self.labels_.iter().max().copied().unwrap_or(-1);
        if max_label < 0 {
            0
        } else {
            (max_label + 1) as usize
        }
    }
}

/// Compute the squared Euclidean distance between two slices.
fn squared_euclidean<F: Float>(a: &[F], b: &[F]) -> F {
    a.iter()
        .zip(b.iter())
        .fold(F::zero(), |acc, (&ai, &bi)| acc + (ai - bi) * (ai - bi))
}

/// Compute the Manhattan (L1) distance `sum(|ai - bi|)` between two slices.
fn manhattan<F: Float>(a: &[F], b: &[F]) -> F {
    a.iter()
        .zip(b.iter())
        .fold(F::zero(), |acc, (&ai, &bi)| acc + (ai - bi).abs())
}

/// Compute the powered Minkowski sum `sum(|ai - bi|^p)` between two slices
/// (the inner term of `(sum(|xi - yi|^p))^(1/p)`, BEFORE the outer root).
fn minkowski_powered_sum<F: Float>(a: &[F], b: &[F], p: F) -> F {
    a.iter()
        .zip(b.iter())
        .fold(F::zero(), |acc, (&ai, &bi)| acc + (ai - bi).abs().powf(p))
}

/// Compute the Chebyshev (L-infinity) distance `max(|ai - bi|)` between two
/// slices.
fn chebyshev<F: Float>(a: &[F], b: &[F]) -> F {
    a.iter()
        .zip(b.iter())
        .fold(F::zero(), |acc, (&ai, &bi)| acc.max((ai - bi).abs()))
}

/// The per-metric threshold form used by [`region_query`].
///
/// To match scikit-learn's `dist(i, j) <= eps` while avoiding extra rounding
/// from an outer root, each variant carries a comparison that is exact in the
/// same algebra sklearn uses:
///
/// - [`SquaredEuclidean(eps^2)`](MetricThreshold::SquaredEuclidean): compare
///   `sum((xi - yi)^2) <= eps^2` (the unchanged default path — no `sqrt`).
/// - [`Manhattan(eps)`](MetricThreshold::Manhattan): compare `sum(|dx|) <= eps`.
/// - [`Minkowski { p, eps_pow_p }`](MetricThreshold::Minkowski): compare the
///   powered sum `sum(|dx|^p) <= eps^p` (no outer `^(1/p)`).
/// - [`Chebyshev(eps)`](MetricThreshold::Chebyshev): compare `max(|dx|) <= eps`.
#[derive(Debug, Clone, Copy)]
enum MetricThreshold<F> {
    /// Compare the sum-of-squares against `eps^2` (Euclidean / L2).
    SquaredEuclidean(F),
    /// Compare the sum of absolute differences against `eps` (Manhattan / L1).
    Manhattan(F),
    /// Compare the powered Minkowski sum `sum(|dx|^p)` against `eps^p`.
    Minkowski {
        /// The Minkowski order `p` (strictly positive).
        p: F,
        /// `eps^p`, precomputed so `region_query` does not take the outer root.
        eps_pow_p: F,
    },
    /// Compare the max absolute difference against `eps` (Chebyshev / L_inf).
    Chebyshev(F),
}

/// Find all neighbors within `eps` distance of point `idx` under the given
/// metric threshold.
///
/// A point `j` is a neighbor iff `dist(idx, j) <= eps`, mirroring
/// `radius_neighbors` (`sklearn/cluster/_dbscan.py:411-422`). Each branch
/// compares the metric's powered form against the matching powered `eps`
/// (e.g. `sum_sq <= eps^2`, `sum(|dx|^p) <= eps^p`) so the default Euclidean
/// path is byte-for-byte unchanged and no extra root rounding is introduced.
fn region_query<F: Float>(x: &Array2<F>, idx: usize, threshold: MetricThreshold<F>) -> Vec<usize> {
    let n_samples = x.nrows();
    let row = x.row(idx);
    let row_slice = row.as_slice().unwrap_or(&[]);

    let mut neighbors = Vec::new();
    for j in 0..n_samples {
        let other = x.row(j);
        let other_slice = other.as_slice().unwrap_or(&[]);
        let included = match threshold {
            MetricThreshold::SquaredEuclidean(eps_sq) => {
                squared_euclidean(row_slice, other_slice) <= eps_sq
            }
            MetricThreshold::Manhattan(eps) => manhattan(row_slice, other_slice) <= eps,
            MetricThreshold::Minkowski { p, eps_pow_p } => {
                minkowski_powered_sum(row_slice, other_slice, p) <= eps_pow_p
            }
            MetricThreshold::Chebyshev(eps) => chebyshev(row_slice, other_slice) <= eps,
        };
        if included {
            neighbors.push(j);
        }
    }
    neighbors
}

impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for DBSCAN<F> {
    type Fitted = FittedDBSCAN<F>;
    type Error = FerroError;

    /// Fit the DBSCAN model to the data.
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::InvalidParameter`] if `eps` is not positive
    /// or `min_samples` is zero.
    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedDBSCAN<F>, FerroError> {
        // Validate parameters.
        if self.eps <= F::zero() {
            return Err(FerroError::InvalidParameter {
                name: "eps".into(),
                reason: "must be positive".into(),
            });
        }

        if self.min_samples == 0 {
            return Err(FerroError::InvalidParameter {
                name: "min_samples".into(),
                reason: "must be at least 1".into(),
            });
        }

        // Reject non-finite X up front (NaN AND Inf), mirroring sklearn's
        // `_validate_data(X, accept_sparse="csr")` (`force_all_finite=True`
        // default) reached from `DBSCAN.fit` (`sklearn/cluster/_dbscan.py:395`),
        // which raises `ValueError("Input X contains NaN.")`
        // (`sklearn/utils/validation.py:147-154`). Never panics (R-CODE-2).
        if x.iter().any(|v| !v.is_finite()) {
            return Err(FerroError::InvalidParameter {
                name: "X".into(),
                reason: "Input X contains NaN or infinity.".into(),
            });
        }

        let n_samples = x.nrows();
        let n_features = x.ncols();

        // Validate `sample_weight` length against `n_samples`, mirroring
        // `_check_sample_weight` (`sklearn/utils/validation.py:2055-2060`):
        // a wrong-length weight vector raises `ValueError`. Negative/zero
        // weights are ALLOWED (DBSCAN does not pass `only_non_negative`).
        if let Some(w) = &self.sample_weight
            && w.len() != n_samples
        {
            return Err(FerroError::ShapeMismatch {
                expected: vec![n_samples],
                actual: vec![w.len()],
                context: "sample_weight".into(),
            });
        }

        if n_samples == 0 {
            return Ok(FittedDBSCAN {
                labels_: Array1::zeros(0),
                core_sample_indices_: Vec::new(),
                components_: Array2::zeros((0, n_features)),
                _marker: std::marker::PhantomData,
            });
        }

        // Resolve the metric into a powered threshold (mirroring sklearn's
        // `NearestNeighbors(metric=..., p=...).radius_neighbors` neighbor test
        // `dist <= eps`, `_dbscan.py:411-422`). Each branch compares a powered
        // distance form against the matching powered `eps` so the default
        // Euclidean path is byte-for-byte the prior `sum_sq <= eps*eps` and no
        // outer root rounding is introduced. `Minkowski(p)` validates `p > 0`
        // (sklearn's `NearestNeighbors` requires `p` in `(0, inf]`; `p=1`
        // collapses to Manhattan, `p=2` to the exact Euclidean sum-of-squares
        // form for byte-for-byte parity with the default).
        let threshold = match self.metric {
            DbscanMetric::Euclidean => MetricThreshold::SquaredEuclidean(self.eps * self.eps),
            DbscanMetric::Manhattan => MetricThreshold::Manhattan(self.eps),
            DbscanMetric::Chebyshev => MetricThreshold::Chebyshev(self.eps),
            DbscanMetric::Minkowski(p) => {
                // Reject any `p` that is not strictly positive (including NaN,
                // which fails every ordered comparison). sklearn's
                // `NearestNeighbors` requires `p` in `(0, inf]`.
                if p <= F::zero() || p.is_nan() {
                    return Err(FerroError::InvalidParameter {
                        name: "p".into(),
                        reason: "must be a positive float (Minkowski order)".into(),
                    });
                }
                if p == F::one() {
                    MetricThreshold::Manhattan(self.eps)
                } else if p == F::from(2).unwrap_or_else(F::nan) {
                    MetricThreshold::SquaredEuclidean(self.eps * self.eps)
                } else {
                    MetricThreshold::Minkowski {
                        p,
                        eps_pow_p: self.eps.powf(p),
                    }
                }
            }
        };

        // Step 1: Find neighborhoods for all points.
        let neighborhoods: Vec<Vec<usize>> = (0..n_samples)
            .map(|i| region_query(x, i, threshold))
            .collect();

        // Step 2: Identify core points.
        //
        // Unweighted (`sample_weight` is `None`): `is_core[i]` iff the
        // neighborhood size (self included) is `>= min_samples`, mirroring
        // `n_neighbors = [len(neighbors) ...]` (`_dbscan.py:425`).
        //
        // Weighted: `is_core[i]` iff the SUM of weights over the SAME
        // neighborhood (self included) is `>= min_samples` (a float compare),
        // mirroring `n_neighbors = [sum(sample_weight[neighbors]) ...]` and
        // `core_samples = n_neighbors >= self.min_samples`
        // (`_dbscan.py:427-435`). `min_samples` is cast to `F` for the compare.
        let is_core: Vec<bool> = match &self.sample_weight {
            None => neighborhoods
                .iter()
                .map(|n| n.len() >= self.min_samples)
                .collect(),
            Some(w) => {
                let min_samples_f = F::from(self.min_samples).unwrap_or_else(F::infinity);
                neighborhoods
                    .iter()
                    .map(|n| {
                        // sklearn computes `np.sum(sample_weight[neighbors])`
                        // (`_dbscan.py:428`); NumPy `np.sum` uses PAIRWISE
                        // summation, not a sequential fold, so a running sum that
                        // straddles `min_samples` (e.g. ten `0.1` weights ->
                        // numpy 1.0 vs sequential 0.999…) must use the SAME
                        // reduction order to match the float `>=` core boundary
                        // (#2189).
                        let neighbor_weights: Vec<F> = n.iter().map(|&j| w[j]).collect();
                        let weight_sum = numpy_pairwise_sum(&neighbor_weights);
                        weight_sum >= min_samples_f
                    })
                    .collect()
            }
        };

        let core_sample_indices: Vec<usize> = (0..n_samples).filter(|&i| is_core[i]).collect();

        // Retain the coordinates of the core samples, mirroring sklearn's
        // `self.components_ = X[self.core_sample_indices_].copy()`
        // (`sklearn/cluster/_dbscan.py:441-446`).
        let mut components_ = Array2::<F>::zeros((core_sample_indices.len(), n_features));
        for (k, &idx) in core_sample_indices.iter().enumerate() {
            components_.row_mut(k).assign(&x.row(idx));
        }

        // Step 3: Expand clusters from core points via BFS.
        let mut labels = Array1::from_elem(n_samples, -1isize);
        let mut current_cluster: isize = -1;

        for i in 0..n_samples {
            // Skip non-core or already-labeled points.
            if !is_core[i] || labels[i] != -1 {
                continue;
            }

            // Start a new cluster.
            current_cluster += 1;
            labels[i] = current_cluster;

            // BFS expansion.
            let mut queue: VecDeque<usize> = VecDeque::new();
            for &neighbor in &neighborhoods[i] {
                if neighbor != i {
                    queue.push_back(neighbor);
                }
            }

            while let Some(q) = queue.pop_front() {
                if labels[q] == -1 {
                    // Assign to current cluster (was noise or unvisited).
                    labels[q] = current_cluster;
                }

                if !is_core[q] {
                    // Border point — don't expand further.
                    continue;
                }

                // If this core point was already assigned to this cluster
                // by a prior BFS step, skip expanding again.
                if labels[q] == current_cluster {
                    // Expand: add unvisited neighbors to queue.
                    for &neighbor in &neighborhoods[q] {
                        if labels[neighbor] == -1 {
                            labels[neighbor] = current_cluster;
                            if is_core[neighbor] {
                                queue.push_back(neighbor);
                            }
                        }
                    }
                }
            }
        }

        Ok(FittedDBSCAN {
            labels_: labels,
            core_sample_indices_: core_sample_indices,
            components_,
            _marker: std::marker::PhantomData,
        })
    }
}

impl<F: Float + Send + Sync + 'static> DBSCAN<F> {
    /// Fit on `x` and return the cluster labels for those samples in one
    /// call. Equivalent to sklearn `ClusterMixin.fit_predict`. Noise
    /// samples are labeled as `-1`.
    ///
    /// # Errors
    ///
    /// Forwards any error from [`Fit::fit`].
    pub fn fit_predict(&self, x: &Array2<F>) -> Result<Array1<isize>, FerroError> {
        let fitted = self.fit(x, &())?;
        Ok(fitted.labels().clone())
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    /// Two well-separated clusters.
    fn make_two_clusters() -> Array2<f64> {
        Array2::from_shape_vec(
            (8, 2),
            vec![
                // Cluster A near (0, 0)
                0.0, 0.0, 0.5, 0.0, 0.0, 0.5, 0.5, 0.5, // Cluster B near (10, 10)
                10.0, 10.0, 10.5, 10.0, 10.0, 10.5, 10.5, 10.5,
            ],
        )
        .unwrap()
    }

    #[test]
    fn test_simple_clusters() {
        let x = make_two_clusters();
        let model = DBSCAN::<f64>::new(1.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        assert_eq!(labels.len(), 8);

        // First 4 should be in one cluster, last 4 in another.
        assert_eq!(labels[0], labels[1]);
        assert_eq!(labels[0], labels[2]);
        assert_eq!(labels[0], labels[3]);
        assert_eq!(labels[4], labels[5]);
        assert_eq!(labels[4], labels[6]);
        assert_eq!(labels[4], labels[7]);
        assert_ne!(labels[0], labels[4]);

        // Two clusters found.
        assert_eq!(fitted.n_clusters(), 2);
    }

    #[test]
    fn test_noise_detection() {
        // Two tight clusters + one outlier.
        let x = Array2::from_shape_vec(
            (7, 2),
            vec![
                0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 5.0, 5.0, 5.1, 5.0, 5.0, 5.1, // Outlier
                100.0, 100.0,
            ],
        )
        .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        // Outlier should be noise.
        assert_eq!(labels[6], -1);
        // Others should not be noise.
        assert!(labels[0] >= 0);
        assert!(labels[3] >= 0);
    }

    #[test]
    fn test_core_border_noise_identification() {
        // Ring of points: 3 core, 1 border, 1 noise.
        let x = Array2::from_shape_vec(
            (5, 2),
            vec![
                0.0, 0.0, // core (neighbors: 0,1,2)
                0.3, 0.0, // core (neighbors: 0,1,2)
                0.0, 0.3, // core (neighbors: 0,1,2)
                0.6, 0.0, // border (neighbor: 1 at least)
                10.0, 10.0, // noise
            ],
        )
        .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(3);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        let core_indices = fitted.core_sample_indices();

        // Points 0, 1, 2 should be core points.
        assert!(core_indices.contains(&0));
        assert!(core_indices.contains(&1));
        assert!(core_indices.contains(&2));

        // Point 3 is a border point (reachable from core point 1).
        assert!(labels[3] >= 0);
        assert!(!core_indices.contains(&3));

        // Point 4 is noise.
        assert_eq!(labels[4], -1);
    }

    #[test]
    fn test_all_noise_with_high_min_samples() {
        let x =
            Array2::from_shape_vec((4, 2), vec![0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]).unwrap();

        // min_samples too high: no point has enough neighbors.
        let model = DBSCAN::<f64>::new(0.5).with_min_samples(100);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        for &label in labels {
            assert_eq!(label, -1);
        }
        assert_eq!(fitted.n_clusters(), 0);
    }

    #[test]
    fn test_all_noise_with_tiny_eps() {
        let x =
            Array2::from_shape_vec((4, 2), vec![0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]).unwrap();

        // eps too small: no points are neighbors of each other (except self).
        let model = DBSCAN::<f64>::new(0.001).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        for &label in fitted.labels() {
            assert_eq!(label, -1);
        }
    }

    #[test]
    fn test_single_point_cluster() {
        // One cluster + one isolated point.
        let x = Array2::from_shape_vec((4, 2), vec![0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 100.0, 100.0])
            .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        // Isolated point should be noise.
        assert_eq!(fitted.labels()[3], -1);
        // Others should be in a cluster.
        assert!(fitted.labels()[0] >= 0);
    }

    #[test]
    fn test_all_in_one_cluster() {
        let x =
            Array2::from_shape_vec((4, 2), vec![0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 0.1, 0.1]).unwrap();

        let model = DBSCAN::<f64>::new(1.0).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        // All should be in the same cluster.
        let labels = fitted.labels();
        assert!(labels[0] >= 0);
        assert_eq!(labels[0], labels[1]);
        assert_eq!(labels[0], labels[2]);
        assert_eq!(labels[0], labels[3]);
        assert_eq!(fitted.n_clusters(), 1);
    }

    #[test]
    fn test_empty_data() {
        let x = Array2::<f64>::zeros((0, 2));
        let model = DBSCAN::<f64>::new(1.0);
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.labels().len(), 0);
        assert!(fitted.core_sample_indices().is_empty());
        assert_eq!(fitted.n_clusters(), 0);
        // components_ is (0, n_features) on empty input, matching sklearn's
        // `np.empty((0, X.shape[1]))` (_dbscan.py:446).
        assert_eq!(fitted.components().shape(), &[0, 2]);
    }

    #[test]
    fn test_single_sample() {
        let x = Array2::from_shape_vec((1, 2), vec![5.0, 5.0]).unwrap();
        let model = DBSCAN::<f64>::new(1.0).with_min_samples(1);
        let fitted = model.fit(&x, &()).unwrap();

        // With min_samples=1, a single point is a core point and its own cluster.
        assert_eq!(fitted.labels()[0], 0);
        assert_eq!(fitted.core_sample_indices(), &[0]);
    }

    #[test]
    fn test_single_sample_noise() {
        let x = Array2::from_shape_vec((1, 2), vec![5.0, 5.0]).unwrap();
        let model = DBSCAN::<f64>::new(1.0).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        // With min_samples=2, a single point cannot be a core point.
        assert_eq!(fitted.labels()[0], -1);
    }

    #[test]
    fn test_invalid_eps() {
        let x = Array2::from_shape_vec((3, 2), vec![1.0, 1.0, 2.0, 2.0, 3.0, 3.0]).unwrap();
        let model = DBSCAN::<f64>::new(-1.0);
        let result = model.fit(&x, &());
        assert!(result.is_err());
    }

    #[test]
    fn test_zero_eps() {
        let x = Array2::from_shape_vec((3, 2), vec![1.0, 1.0, 2.0, 2.0, 3.0, 3.0]).unwrap();
        let model = DBSCAN::<f64>::new(0.0);
        let result = model.fit(&x, &());
        assert!(result.is_err());
    }

    #[test]
    fn test_zero_min_samples() {
        let x = Array2::from_shape_vec((3, 2), vec![1.0, 1.0, 2.0, 2.0, 3.0, 3.0]).unwrap();
        let model = DBSCAN::<f64>::new(1.0).with_min_samples(0);
        let result = model.fit(&x, &());
        assert!(result.is_err());
    }

    #[test]
    fn test_core_sample_indices_correct() {
        let x = Array2::from_shape_vec(
            (5, 2),
            vec![0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 0.1, 0.1, 100.0, 100.0],
        )
        .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(3);
        let fitted = model.fit(&x, &()).unwrap();

        // All core samples should have enough neighbors.
        for &idx in fitted.core_sample_indices() {
            assert!(idx < 5);
        }
    }

    #[test]
    fn test_f32_support() {
        let x = Array2::from_shape_vec(
            (6, 2),
            vec![
                0.0f32, 0.0, 0.1, 0.0, 0.0, 0.1, 10.0, 10.0, 10.1, 10.0, 10.0, 10.1,
            ],
        )
        .unwrap();

        let model = DBSCAN::<f32>::new(0.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.labels().len(), 6);
        assert_eq!(fitted.n_clusters(), 2);
    }

    #[test]
    fn test_three_clusters() {
        let x = Array2::from_shape_vec(
            (9, 2),
            vec![
                0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 5.0, 5.0, 5.1, 5.0, 5.0, 5.1, 10.0, 0.0, 10.1, 0.0,
                10.0, 0.1,
            ],
        )
        .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.n_clusters(), 3);

        let labels = fitted.labels();
        // Same cluster groupings.
        assert_eq!(labels[0], labels[1]);
        assert_eq!(labels[0], labels[2]);
        assert_eq!(labels[3], labels[4]);
        assert_eq!(labels[3], labels[5]);
        assert_eq!(labels[6], labels[7]);
        assert_eq!(labels[6], labels[8]);
        // Different clusters.
        assert_ne!(labels[0], labels[3]);
        assert_ne!(labels[0], labels[6]);
        assert_ne!(labels[3], labels[6]);
    }

    #[test]
    fn dbscan_components_match_sklearn() {
        // Live sklearn 1.5.2 oracle (run from /tmp):
        //   X = [[1,1],[1.2,1.1],[0.9,1.0],[8,8],[8.1,8.2],[8.0,7.9],[5,5]]
        //   DBSCAN(eps=0.5, min_samples=3).fit(X)
        //   labels_              = [0,0,0,1,1,1,-1]
        //   core_sample_indices_ = [0,1,2,3,4,5]
        //   components_          = [[1,1],[1.2,1.1],[0.9,1.0],[8,8],[8.1,8.2],[8.0,7.9]]
        // Mirrors `self.components_ = X[self.core_sample_indices_].copy()`
        // (sklearn/cluster/_dbscan.py:441-446).
        let x = Array2::from_shape_vec(
            (7, 2),
            vec![
                1.0, 1.0, 1.2, 1.1, 0.9, 1.0, 8.0, 8.0, 8.1, 8.2, 8.0, 7.9, 5.0, 5.0,
            ],
        )
        .unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(3);
        let fitted = model.fit(&x, &()).unwrap();

        // core_sample_indices_ oracle.
        assert_eq!(fitted.core_sample_indices(), &[0, 1, 2, 3, 4, 5]);

        let oracle = Array2::from_shape_vec(
            (6, 2),
            vec![1.0, 1.0, 1.2, 1.1, 0.9, 1.0, 8.0, 8.0, 8.1, 8.2, 8.0, 7.9],
        )
        .unwrap();

        let components = fitted.components();
        assert_eq!(components.shape(), &[6, 2]);

        // Element-wise match against the oracle matrix (<= 1e-12).
        for (got, want) in components.iter().zip(oracle.iter()) {
            assert!(
                (got - want).abs() <= 1e-12,
                "components_ mismatch: got {got}, want {want}"
            );
        }

        // Invariant: each row of components_ equals X[core_sample_indices()[k]].
        for (k, &idx) in fitted.core_sample_indices().iter().enumerate() {
            for j in 0..x.ncols() {
                assert!(
                    (components[[k, j]] - x[[idx, j]]).abs() <= 1e-12,
                    "row {k} != X[{idx}] at col {j}"
                );
            }
        }
    }

    #[test]
    fn test_identical_points() {
        let x =
            Array2::from_shape_vec((4, 2), vec![1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]).unwrap();

        let model = DBSCAN::<f64>::new(0.5).with_min_samples(2);
        let fitted = model.fit(&x, &()).unwrap();

        // All identical points should be in the same cluster.
        let labels = fitted.labels();
        assert!(labels[0] >= 0);
        for &label in labels {
            assert_eq!(label, labels[0]);
        }
    }
}