ferrolearn-cluster 0.5.0

//! HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise).
//!
//! This module provides [`Hdbscan`], a hierarchical extension of DBSCAN that
//! automatically discovers clusters of varying densities. Unlike DBSCAN, which
//! uses a single `eps` threshold, HDBSCAN builds a hierarchy of clusterings
//! and selects the most persistent clusters.
//!
//! # Algorithm
//!
//! 1. Compute the **core distance** for each point (distance to the
//!    `min_samples`-th nearest neighbor).
//! 2. Build the **mutual reachability graph**: for each pair `(i, j)`, the
//!    mutual reachability distance is
//!    `max(core_dist[i], core_dist[j], d(i, j))`.
//! 3. Compute the **minimum spanning tree** (MST) of this graph using Prim's
//!    algorithm.
//! 4. Sort MST edges by weight and build a **condensed cluster tree**.
//! 5. Extract flat clusters from the condensed tree by maximizing cluster
//!    stability.
//!
//! # Notes
//!
//! HDBSCAN does **not** implement [`Predict`](ferrolearn_core::Predict) — it
//! only labels the training data. Use the fitted labels directly.
//!
//! # Examples
//!
//! ```
//! use ferrolearn_cluster::Hdbscan;
//! use ferrolearn_core::Fit;
//! use ndarray::Array2;
//!
//! let x = Array2::from_shape_vec((8, 2), vec![
//!     0.0, 0.0,  0.1, 0.0,  0.0, 0.1,  0.1, 0.1,
//!     5.0, 5.0,  5.1, 5.0,  5.0, 5.1,  5.1, 5.1,
//! ]).unwrap();
//!
//! let model = Hdbscan::<f64>::new();
//! let fitted = model.with_min_cluster_size(3).fit(&x, &()).unwrap();
//! let labels = fitted.labels();
//! assert_eq!(labels.len(), 8);
//! ```
//!
//! # `## REQ status`
//!
//! Binary (R-DEFER-2), translating `sklearn/cluster/_hdbscan/hdbscan.py`
//! (`class HDBSCAN`). Design doc: `.design/cluster/hdbscan.md`. Cites use ferrolearn
//! symbol anchors / sklearn `file:line` (commit 156ef14); expected values from the
//! live sklearn 1.5.2 oracle (R-CHAR-3). HDBSCAN is deterministic (no RNG), so the
//! partition value-parity is genuine. No PyO3 binding — only consumer is the crate
//! re-export. ferrolearn implements the correct algorithm SHAPE (core-dist →
//! mutual-reachability → MST → condensed tree → EOM stability), but the condensed-tree
//! stability/probability formulas and the parameter/attribute surface diverge.
//!
//! | REQ | Status | Evidence |
//! |---|---|---|
//! | REQ-1 (`labels_` PARTITION + noise set matches sklearn, well-separated data) | SHIPPED | impl `Fit::fit` (`compute_core_distances` → `build_mst` Prim → `extract_clusters` single-linkage + EOM) produces the SAME co-membership partition + `-1` noise set as sklearn `HDBSCAN` on well-separated/dense blobs (deterministic — genuine value-parity, not just up-to-permutation, on these fixtures). Consumer: crate re-export `pub use hdbscan::{FittedHdbscan, Hdbscan}` (`lib.rs`). Guards: `req1_partition_two_blobs_matches_sklearn`, `req1_partition_outliers_matches_sklearn` in `tests/divergence_hdbscan.rs` (live-oracle, robust to the REQ-5 fix). Underclaim: the absolute label INTEGERS (REQ-6) and behavior on borderline-density data (gated on the exact condensed-tree algorithm) may differ. |
//! | REQ-2 (`probabilities_` range + noise-prob-0 contract) | SHIPPED | impl `extract_clusters` yields `probabilities_` ∈ [0,1] with noise points at 0. Consumer: crate re-export. Guard: `req2_probabilities_range_and_noise_zero`. Underclaim: CONTRACT (range + noise-0) only — the exact GLOSH probability VALUES diverge (REQ-4 #1069). |
//! | REQ-3 (defaults) | SHIPPED | impl `fn new` defaults `min_cluster_size=5`, `min_samples=None`, `cluster_selection_epsilon=0.0` = sklearn `HDBSCAN` defaults (`hdbscan.py:674-676`); `min_cluster_size>=2` validation matches sklearn `Interval(Integral,2,None)`. Guard: `req3_defaults_match_sklearn`. |
//! | REQ-5 (core-distance index matches sklearn) | SHIPPED | impl `fn compute_core_distances` now takes `sorted_dists[min_samples-1]` (the `min_samples`-NN query includes self at index 0) = sklearn `core_distances = neighbors_distances[:, -1]` (`hdbscan.py:351-352`). Was `dists[min_samples]` (one neighbor too far → ferrolearn ms=k reproduced sklearn ms=k+1). Guard: `divergence_core_distance_off_by_one_partition` (min_samples-sensitive fixture: ferrolearn ms=3 now == sklearn ms=3). Fixed #1070. |
//! | REQ-4 (`probabilities_` GLOSH VALUES) | NOT-STARTED | open prereq blocker #1069. sklearn probability = `lambda_p / max_lambda_in_cluster` (GLOSH, `_hdbscan/_tree.pyx`); ferrolearn `extract_clusters` uses `(child_lambda - birth)/(death - birth)`. Different formula → values diverge (two_blobs: sklearn `[1,0.707,...]` vs ferrolearn all `1.0`). |
//! | REQ-6 (`labels_` integer numbering) | NOT-STARTED | open prereq blocker #1071. sklearn numbers clusters by its condensed-tree/EOM traversal; ferrolearn by `label_counter` selection order (`extract_clusters`). Partition matches up-to-permutation; absolute integers diverge. |
//! | REQ-7 (`cluster_selection_method` eom/leaf + allow_single_cluster + max_cluster_size) | NOT-STARTED | open prereq blocker #1072. sklearn `cluster_selection_method` default `'eom'` + `'leaf'`, `allow_single_cluster`, `max_cluster_size`; ferrolearn EOM-only, none of these params. |
//! | REQ-8 (`metric`/`alpha`/`algorithm`/`leaf_size`/`n_jobs`) | NOT-STARTED | open prereq blocker #1073. sklearn `metric` (many), `alpha` (1.0), `algorithm` (auto/brute/kd_tree/ball_tree), `leaf_size`, `n_jobs`; ferrolearn euclidean-only Prim O(n²), none. |
//! | REQ-9 (PyO3 binding) | NOT-STARTED | open prereq blocker #1074. No `_RsHDBSCAN` (grep empty); `import ferrolearn` cannot reach HDBSCAN. |
//! | REQ-10 (`centroids_`/`medoids_`/`n_features_in_` + `store_centers`) | NOT-STARTED | open prereq blocker #1075. sklearn `store_centers` → `centroids_`/`medoids_`, plus `n_features_in_`; `FittedHdbscan` exposes `labels_`/`probabilities_` only (+ `n_clusters()` helper sklearn lacks). |
//! | REQ-11 (error ABI + n_samples==1) | NOT-STARTED | open prereq blocker #1076. sklearn `InvalidParameterError` vs ferrolearn `FerroError`; single-point/edge handling vs sklearn. |
//! | REQ-12 (`cluster_selection_epsilon` exact semantics) | NOT-STARTED | open prereq blocker #1077. sklearn DBSCAN-like flat-cut epsilon in `_tree.pyx`; ferrolearn `extract_clusters` `above_epsilon = split_dist > eps` is a simplified approximation. |
//! | REQ-13 (ferray substrate) | NOT-STARTED | open prereq blocker #1078. `hdbscan.rs` imports `ndarray`/`num-traits`, not `ferray-core` (R-SUBSTRATE-1/2). |

use ferrolearn_core::error::FerroError;
use ferrolearn_core::traits::Fit;
use ndarray::{Array1, Array2};
use num_traits::Float;

/// HDBSCAN clustering configuration (unfitted).
///
/// Holds hyperparameters for the HDBSCAN algorithm. Call [`Fit::fit`]
/// to run the algorithm and produce a [`FittedHdbscan`].
///
/// # Type Parameters
///
/// - `F`: The floating-point type (`f32` or `f64`).
#[derive(Debug, Clone)]
pub struct Hdbscan<F> {
    /// The minimum number of points required to form a cluster.
    pub min_cluster_size: usize,
    /// The number of neighbors used to compute core distances.
    /// Defaults to `min_cluster_size` if `None`.
    pub min_samples: Option<usize>,
    /// A distance threshold. Clusters below this epsilon are not split.
    /// Default is `0.0` (no epsilon constraint).
    pub cluster_selection_epsilon: F,
}

impl<F: Float> Hdbscan<F> {
    /// Create a new `Hdbscan` with default parameters.
    ///
    /// Defaults: `min_cluster_size = 5`, `min_samples = None`,
    /// `cluster_selection_epsilon = 0.0`.
    #[must_use]
    pub fn new() -> Self {
        Self {
            min_cluster_size: 5,
            min_samples: None,
            cluster_selection_epsilon: F::zero(),
        }
    }

    /// Set the minimum cluster size.
    #[must_use]
    pub fn with_min_cluster_size(mut self, min_cluster_size: usize) -> Self {
        self.min_cluster_size = min_cluster_size;
        self
    }

    /// Set the number of samples for core distance computation.
    #[must_use]
    pub fn with_min_samples(mut self, min_samples: usize) -> Self {
        self.min_samples = Some(min_samples);
        self
    }

    /// Set the cluster selection epsilon.
    #[must_use]
    pub fn with_cluster_selection_epsilon(mut self, eps: F) -> Self {
        self.cluster_selection_epsilon = eps;
        self
    }
}

impl<F: Float> Default for Hdbscan<F> {
    fn default() -> Self {
        Self::new()
    }
}

/// Fitted HDBSCAN model.
///
/// Stores the cluster labels and cluster membership probabilities from the
/// training run. Noise points are labeled with `-1`.
///
/// HDBSCAN does **not** implement [`Predict`](ferrolearn_core::Predict).
#[derive(Debug, Clone)]
pub struct FittedHdbscan<F> {
    /// Cluster label for each training sample. Noise points have label -1.
    labels_: Array1<isize>,
    /// Cluster membership probabilities for each training sample.
    /// Values range from 0.0 (noise) to 1.0 (strong membership).
    probabilities_: Array1<F>,
}

impl<F: Float> FittedHdbscan<F> {
    /// Return the cluster labels for the training data.
    ///
    /// Noise points have label `-1`.
    #[must_use]
    pub fn labels(&self) -> &Array1<isize> {
        &self.labels_
    }

    /// Return the cluster membership probabilities.
    ///
    /// Values range from `0.0` (noise) to `1.0` (strong cluster membership).
    #[must_use]
    pub fn probabilities(&self) -> &Array1<F> {
        &self.probabilities_
    }

    /// Return the number of clusters found (excluding noise).
    #[must_use]
    pub fn n_clusters(&self) -> usize {
        let max_label = self.labels_.iter().max().copied().unwrap_or(-1);
        if max_label < 0 {
            0
        } else {
            (max_label + 1) as usize
        }
    }
}

// ─────────────────────────────────────────────────────────────────────────────
// Internal helpers
// ─────────────────────────────────────────────────────────────────────────────

/// Compute the squared Euclidean distance between two slices.
#[inline]
fn sq_euclidean<F: Float>(a: &[F], b: &[F]) -> F {
    a.iter()
        .zip(b.iter())
        .fold(F::zero(), |acc, (&ai, &bi)| acc + (ai - bi) * (ai - bi))
}

/// Compute the core distance for each point: the distance to its
/// `min_samples`-th nearest neighbor, where the `min_samples`-nearest-neighbors
/// query INCLUDES the point itself (self at sorted index 0, distance 0). This is
/// therefore `sorted_dists[min_samples - 1]`, matching sklearn
/// (`sklearn/cluster/_hdbscan/hdbscan.py:351-352`,
/// `core_distances = neighbors_distances[:, -1]` of the `min_samples` NN query).
fn compute_core_distances<F: Float>(x: &Array2<F>, min_samples: usize) -> Vec<F> {
    let n = x.nrows();
    let mut core_dists = vec![F::zero(); n];

    for (i, cd) in core_dists.iter_mut().enumerate() {
        let row_i = x.row(i);
        let si = row_i.as_slice().unwrap_or(&[]);
        let mut dists: Vec<F> = (0..n)
            .map(|j| {
                if i == j {
                    F::zero()
                } else {
                    sq_euclidean(si, x.row(j).as_slice().unwrap_or(&[])).sqrt()
                }
            })
            .collect();
        dists.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
        // sklearn core distance = sorted_dists[min_samples-1] (the min_samples-nearest-
        // neighbors query includes self at index 0), sklearn/cluster/_hdbscan/hdbscan.py:351-352.
        // `fn fit` validates min_samples >= 1, so this never underflows; saturating_sub
        // is defensive and clamped to the last valid index.
        let k = min_samples.saturating_sub(1).min(n - 1);
        *cd = dists[k];
    }
    core_dists
}

/// An MST edge.
#[derive(Debug, Clone, Copy)]
struct MstEdge<F> {
    /// Source node.
    u: usize,
    /// Destination node.
    v: usize,
    /// Weight (mutual reachability distance).
    weight: F,
}

/// Build the minimum spanning tree of the mutual reachability graph using
/// Prim's algorithm.
fn build_mst<F: Float>(x: &Array2<F>, core_dists: &[F]) -> Vec<MstEdge<F>> {
    let n = x.nrows();
    if n <= 1 {
        return Vec::new();
    }

    let mut in_tree = vec![false; n];
    let mut min_weight = vec![F::max_value(); n];
    let mut min_source = vec![0usize; n];
    let mut edges = Vec::with_capacity(n - 1);

    // Start from node 0.
    in_tree[0] = true;
    for j in 1..n {
        let d = mutual_reachability(x, core_dists, 0, j);
        min_weight[j] = d;
        min_source[j] = 0;
    }

    for _ in 0..(n - 1) {
        // Find the node not yet in the tree with the minimum edge weight.
        let mut best_node = 0;
        let mut best_weight = F::max_value();
        for j in 0..n {
            if !in_tree[j] && min_weight[j] < best_weight {
                best_weight = min_weight[j];
                best_node = j;
            }
        }

        in_tree[best_node] = true;
        edges.push(MstEdge {
            u: min_source[best_node],
            v: best_node,
            weight: best_weight,
        });

        // Update min_weight for remaining nodes.
        for j in 0..n {
            if !in_tree[j] {
                let d = mutual_reachability(x, core_dists, best_node, j);
                if d < min_weight[j] {
                    min_weight[j] = d;
                    min_source[j] = best_node;
                }
            }
        }
    }

    edges
}

/// Compute the mutual reachability distance between points i and j.
#[inline]
fn mutual_reachability<F: Float>(x: &Array2<F>, core_dists: &[F], i: usize, j: usize) -> F {
    let d = sq_euclidean(
        x.row(i).as_slice().unwrap_or(&[]),
        x.row(j).as_slice().unwrap_or(&[]),
    )
    .sqrt();
    // max(core_dist[i], core_dist[j], d(i, j))
    let mut result = d;
    if core_dists[i] > result {
        result = core_dists[i];
    }
    if core_dists[j] > result {
        result = core_dists[j];
    }
    result
}

/// A node in the condensed tree.
#[derive(Debug, Clone)]
struct CondensedNode {
    /// Child cluster IDs (could be leaf points represented as negative IDs).
    children: Vec<usize>,
    /// Lambda (1/distance) at which each child fell out.
    child_lambdas: Vec<f64>,
    /// Lambda at which this cluster was born (parent split).
    birth_lambda: f64,
    /// Lambda at which this cluster dies (further split).
    death_lambda: f64,
    /// The stability of this cluster.
    stability: f64,
    /// Number of points in this cluster.
    size: usize,
}

/// Build the condensed cluster tree from sorted MST edges, and extract
/// clusters. Returns (labels, probabilities).
fn extract_clusters<F: Float>(
    n_samples: usize,
    mst_edges: &mut [MstEdge<F>],
    min_cluster_size: usize,
    cluster_selection_epsilon: F,
) -> (Array1<isize>, Array1<F>) {
    if n_samples == 0 {
        return (Array1::zeros(0), Array1::zeros(0));
    }

    // Sort MST edges by weight (ascending).
    mst_edges.sort_by(|a, b| {
        a.weight
            .partial_cmp(&b.weight)
            .unwrap_or(std::cmp::Ordering::Equal)
    });

    // Union-Find for tracking component membership.
    let mut parent: Vec<usize> = (0..n_samples).collect();
    let mut size = vec![1usize; n_samples];

    fn find(parent: &mut [usize], i: usize) -> usize {
        let mut root = i;
        while parent[root] != root {
            root = parent[root];
        }
        // Path compression.
        let mut current = i;
        while parent[current] != root {
            let next = parent[current];
            parent[current] = root;
            current = next;
        }
        root
    }

    // Build the single-linkage dendrogram by processing edges in order.
    // We track a condensed tree: only splits where both children have
    // at least min_cluster_size points are considered real cluster splits.

    // For each point, track which condensed cluster it belongs to.
    let mut point_cluster = vec![0usize; n_samples]; // condensed cluster ID
    let mut condensed_clusters: Vec<CondensedNode> = Vec::new();

    // Start: all points are in one big cluster (root).
    condensed_clusters.push(CondensedNode {
        children: Vec::new(),
        child_lambdas: Vec::new(),
        birth_lambda: 0.0,
        death_lambda: 0.0,
        stability: 0.0,
        size: n_samples,
    });

    // Process edges from largest to smallest (we want to split, so reverse).
    // Actually, for building the hierarchy top-down, we process from largest
    // distance down. But the standard approach is:
    // 1. Build single-linkage tree bottom-up (process edges smallest to largest).
    // 2. Convert to condensed tree.
    // Let me do the standard approach: build the dendrogram bottom-up.

    // Dendrogram: each merge is (child_a, child_b, distance, new_size).
    let mut dendrogram: Vec<(usize, usize, f64, usize)> = Vec::with_capacity(n_samples - 1);
    let mut next_cluster_id = n_samples; // IDs >= n_samples are internal nodes.

    // Map from UF root to current dendrogram cluster ID.
    let mut uf_to_cluster: Vec<usize> = (0..n_samples).collect();

    for edge in mst_edges.iter() {
        let ru = find(&mut parent, edge.u);
        let rv = find(&mut parent, edge.v);
        if ru == rv {
            continue;
        }

        let su = size[ru];
        let sv = size[rv];
        let cu = uf_to_cluster[ru];
        let cv = uf_to_cluster[rv];
        let dist = edge.weight.to_f64().unwrap_or(0.0);

        dendrogram.push((cu, cv, dist, su + sv));

        // Merge: smaller into larger.
        if su < sv {
            parent[ru] = rv;
            size[rv] += su;
            uf_to_cluster[rv] = next_cluster_id;
        } else {
            parent[rv] = ru;
            size[ru] += sv;
            uf_to_cluster[ru] = next_cluster_id;
        }
        next_cluster_id += 1;
    }

    // Now build the condensed tree from the dendrogram.
    // Total nodes: n_samples (leaves) + dendrogram.len() (internal).
    let total_nodes = n_samples + dendrogram.len();

    // For each internal node, record its children and merge distance.
    let mut node_children: Vec<(usize, usize)> = vec![(0, 0); total_nodes];
    let mut node_distance: Vec<f64> = vec![0.0; total_nodes];
    let mut node_size: Vec<usize> = vec![1; total_nodes];

    for (i, &(left, right, dist, sz)) in dendrogram.iter().enumerate() {
        let node_id = n_samples + i;
        node_children[node_id] = (left, right);
        node_distance[node_id] = dist;
        node_size[node_id] = sz;
    }

    // Traverse the dendrogram top-down to build the condensed tree.
    // The root is the last internal node.
    let root_node = if dendrogram.is_empty() {
        0
    } else {
        n_samples + dendrogram.len() - 1
    };

    // Condensed tree: track which cluster each point ends up in.
    // Start fresh.
    condensed_clusters.clear();
    let mut cond_cluster_counter: usize = 0;

    // Map from dendrogram node to condensed cluster.
    let mut node_to_cond: Vec<Option<usize>> = vec![None; total_nodes];

    // BFS/DFS from root.
    let root_cond = cond_cluster_counter;
    condensed_clusters.push(CondensedNode {
        children: Vec::new(),
        child_lambdas: Vec::new(),
        birth_lambda: 0.0,
        death_lambda: 0.0,
        stability: 0.0,
        size: n_samples,
    });
    cond_cluster_counter += 1;
    node_to_cond[root_node] = Some(root_cond);

    // Stack for DFS: (dendrogram_node_id, parent_condensed_cluster_id).
    let mut stack: Vec<(usize, usize)> = vec![(root_node, root_cond)];

    let eps_f64 = cluster_selection_epsilon.to_f64().unwrap_or(0.0);

    while let Some((node_id, parent_cond)) = stack.pop() {
        if node_id < n_samples {
            // This is a leaf (individual point). Add it to the parent condensed cluster.
            condensed_clusters[parent_cond].children.push(node_id);
            let lambda = if condensed_clusters[parent_cond].death_lambda > 0.0 {
                condensed_clusters[parent_cond].death_lambda
            } else {
                condensed_clusters[parent_cond].birth_lambda
            };
            condensed_clusters[parent_cond].child_lambdas.push(lambda);
            point_cluster[node_id] = parent_cond;
            continue;
        }

        let (left, right) = node_children[node_id];
        let left_size = node_size.get(left).copied().unwrap_or(1);
        let right_size = node_size.get(right).copied().unwrap_or(1);
        let split_dist = node_distance[node_id];
        let lambda = if split_dist > 0.0 {
            1.0 / split_dist
        } else {
            f64::MAX
        };

        let both_large = left_size >= min_cluster_size && right_size >= min_cluster_size;
        let above_epsilon = split_dist > eps_f64;

        if both_large && above_epsilon {
            // True split: create two new condensed clusters.
            condensed_clusters[parent_cond].death_lambda = lambda;

            let left_cond = cond_cluster_counter;
            condensed_clusters.push(CondensedNode {
                children: Vec::new(),
                child_lambdas: Vec::new(),
                birth_lambda: lambda,
                death_lambda: 0.0,
                stability: 0.0,
                size: left_size,
            });
            cond_cluster_counter += 1;

            let right_cond = cond_cluster_counter;
            condensed_clusters.push(CondensedNode {
                children: Vec::new(),
                child_lambdas: Vec::new(),
                birth_lambda: lambda,
                death_lambda: 0.0,
                stability: 0.0,
                size: right_size,
            });
            cond_cluster_counter += 1;

            condensed_clusters[parent_cond]
                .children
                .push(left_cond + n_samples);
            condensed_clusters[parent_cond].child_lambdas.push(lambda);
            condensed_clusters[parent_cond]
                .children
                .push(right_cond + n_samples);
            condensed_clusters[parent_cond].child_lambdas.push(lambda);

            node_to_cond[left] = Some(left_cond);
            node_to_cond[right] = Some(right_cond);

            stack.push((left, left_cond));
            stack.push((right, right_cond));
        } else {
            // Not a true split. One or both sides are too small, or below epsilon.
            // Points from the small side become noise (fall out of parent).
            // Points from the large side continue in the parent cluster.

            if left_size >= min_cluster_size && !above_epsilon {
                // Both above min_cluster_size but below epsilon — keep in parent.
                stack.push((left, parent_cond));
                stack.push((right, parent_cond));
            } else if left_size < min_cluster_size && right_size < min_cluster_size {
                // Both sides too small — all points fall out.
                // Add all leaves to parent as fallen-out.
                collect_leaves(left, n_samples, &node_children, &mut |leaf| {
                    condensed_clusters[parent_cond].children.push(leaf);
                    condensed_clusters[parent_cond].child_lambdas.push(lambda);
                    point_cluster[leaf] = parent_cond;
                });
                collect_leaves(right, n_samples, &node_children, &mut |leaf| {
                    condensed_clusters[parent_cond].children.push(leaf);
                    condensed_clusters[parent_cond].child_lambdas.push(lambda);
                    point_cluster[leaf] = parent_cond;
                });
                condensed_clusters[parent_cond].death_lambda = lambda;
            } else {
                // One side is large, the other is small.
                let (large, small) = if left_size >= min_cluster_size {
                    (left, right)
                } else {
                    (right, left)
                };

                // Small side: points fall out of parent cluster.
                collect_leaves(small, n_samples, &node_children, &mut |leaf| {
                    condensed_clusters[parent_cond].children.push(leaf);
                    condensed_clusters[parent_cond].child_lambdas.push(lambda);
                    point_cluster[leaf] = parent_cond;
                });

                // Large side continues in parent.
                stack.push((large, parent_cond));
            }
        }
    }

    // Compute stability for each condensed cluster.
    // Stability = sum over points p in cluster: (lambda_p - birth_lambda)
    for cluster in &mut condensed_clusters {
        let birth = cluster.birth_lambda;
        let mut stab = 0.0;
        for &child_lambda in &cluster.child_lambdas {
            stab += child_lambda - birth;
        }
        cluster.stability = if stab > 0.0 { stab } else { 0.0 };
    }

    // Select clusters: bottom-up, each cluster is selected if its own stability
    // exceeds the sum of its children's stabilities.
    let n_cond = condensed_clusters.len();
    let mut selected = vec![true; n_cond];
    let mut total_stability = vec![0.0f64; n_cond];

    // Initialize: leaf condensed clusters have their own stability.
    for i in 0..n_cond {
        total_stability[i] = condensed_clusters[i].stability;
    }

    // Process bottom-up. Find parent-child relationships among condensed clusters.
    // A condensed cluster's children (that are clusters, not points) are those
    // that have child IDs >= n_samples in the children vector.
    for i in (0..n_cond).rev() {
        let mut child_sum = 0.0;
        let mut has_child_clusters = false;
        for &child_id in &condensed_clusters[i].children {
            if child_id >= n_samples {
                let cond_child = child_id - n_samples;
                if cond_child < n_cond {
                    child_sum += total_stability[cond_child];
                    has_child_clusters = true;
                }
            }
        }

        if has_child_clusters {
            if condensed_clusters[i].stability >= child_sum {
                // Select this cluster, deselect all descendants.
                total_stability[i] = condensed_clusters[i].stability;
                // Mark all descendant clusters as not selected.
                deselect_descendants(i, n_samples, &condensed_clusters, &mut selected);
            } else {
                // Children are better; don't select this cluster.
                selected[i] = false;
                total_stability[i] = child_sum;
            }
        }
        // Leaf condensed clusters (no child clusters) remain selected.
    }

    // Root cluster is never selected as a "real" cluster (it contains everything).
    if n_cond > 0 {
        selected[0] = false;
    }

    // Assign labels: map condensed cluster IDs to label IDs.
    let mut cluster_label_map: Vec<isize> = vec![-1; n_cond];
    let mut label_counter: isize = 0;
    for (i, &sel) in selected.iter().enumerate() {
        if sel && condensed_clusters[i].size >= min_cluster_size {
            cluster_label_map[i] = label_counter;
            label_counter += 1;
        }
    }

    // Assign each point a label based on which selected cluster it belongs to.
    let mut labels = Array1::from_elem(n_samples, -1isize);
    let mut probabilities = Array1::from_elem(n_samples, F::zero());

    for (pt, &cond_id) in point_cluster.iter().enumerate() {
        // Walk up from cond_id to find the nearest selected ancestor.
        let label =
            find_selected_cluster(cond_id, &cluster_label_map, &condensed_clusters, n_samples);
        labels[pt] = label;
    }

    // Compute probabilities based on the lambda at which each point falls out
    // vs. the cluster's birth and death lambdas.
    for cond_id in 0..n_cond {
        if cluster_label_map[cond_id] < 0 {
            continue;
        }
        let birth = condensed_clusters[cond_id].birth_lambda;
        let death = condensed_clusters[cond_id].death_lambda;
        let range = if death > birth { death - birth } else { 1.0 };

        for (idx, &child) in condensed_clusters[cond_id].children.iter().enumerate() {
            if child < n_samples {
                let child_lambda = condensed_clusters[cond_id].child_lambdas[idx];
                let prob = if range > 0.0 {
                    ((child_lambda - birth) / range).clamp(0.0, 1.0)
                } else {
                    1.0
                };
                if labels[child] == cluster_label_map[cond_id] {
                    probabilities[child] = F::from(prob).unwrap_or_else(F::zero);
                }
            }
        }
    }

    // Points that were assigned to a cluster but didn't get a probability
    // from the loop above (because they're in a descendant cluster that was
    // absorbed) get probability 1.0.
    for i in 0..n_samples {
        if labels[i] >= 0 && probabilities[i] == F::zero() {
            probabilities[i] = F::one();
        }
    }

    (labels, probabilities)
}

/// Collect all leaf (point) indices under a dendrogram node.
fn collect_leaves(
    node_id: usize,
    n_samples: usize,
    node_children: &[(usize, usize)],
    callback: &mut dyn FnMut(usize),
) {
    if node_id < n_samples {
        callback(node_id);
        return;
    }
    let (left, right) = node_children[node_id];
    collect_leaves(left, n_samples, node_children, callback);
    collect_leaves(right, n_samples, node_children, callback);
}

/// Mark all descendant condensed clusters as not selected.
fn deselect_descendants(
    cond_id: usize,
    n_samples: usize,
    condensed_clusters: &[CondensedNode],
    selected: &mut [bool],
) {
    for &child_id in &condensed_clusters[cond_id].children {
        if child_id >= n_samples {
            let child_cond = child_id - n_samples;
            if child_cond < condensed_clusters.len() {
                selected[child_cond] = false;
                deselect_descendants(child_cond, n_samples, condensed_clusters, selected);
            }
        }
    }
}

/// Walk up from a condensed cluster to find the nearest selected ancestor.
fn find_selected_cluster(
    cond_id: usize,
    cluster_label_map: &[isize],
    condensed_clusters: &[CondensedNode],
    n_samples: usize,
) -> isize {
    // Check if this cluster is selected.
    if cluster_label_map[cond_id] >= 0 {
        return cluster_label_map[cond_id];
    }

    // Check ancestors. We need to find which condensed cluster contains
    // cond_id as a child. Build a parent map.
    // Since this is called for each point, we do a simple search.
    for (i, cluster) in condensed_clusters.iter().enumerate() {
        for &child_id in &cluster.children {
            if child_id >= n_samples && child_id - n_samples == cond_id {
                return find_selected_cluster(i, cluster_label_map, condensed_clusters, n_samples);
            }
        }
    }

    -1 // Noise.
}

// ─────────────────────────────────────────────────────────────────────────────
// Trait impl: Fit
// ─────────────────────────────────────────────────────────────────────────────

impl<F: Float + Send + Sync + 'static> Fit<Array2<F>, ()> for Hdbscan<F> {
    type Fitted = FittedHdbscan<F>;
    type Error = FerroError;

    /// Fit the HDBSCAN model to the data.
    ///
    /// # Errors
    ///
    /// Returns [`FerroError::InvalidParameter`] if `min_cluster_size` is less
    /// than 2.
    fn fit(&self, x: &Array2<F>, _y: &()) -> Result<FittedHdbscan<F>, FerroError> {
        if self.min_cluster_size < 2 {
            return Err(FerroError::InvalidParameter {
                name: "min_cluster_size".into(),
                reason: "must be at least 2".into(),
            });
        }

        let min_samples = self.min_samples.unwrap_or(self.min_cluster_size);
        if min_samples == 0 {
            return Err(FerroError::InvalidParameter {
                name: "min_samples".into(),
                reason: "must be at least 1".into(),
            });
        }

        if self.cluster_selection_epsilon < F::zero() {
            return Err(FerroError::InvalidParameter {
                name: "cluster_selection_epsilon".into(),
                reason: "must be non-negative".into(),
            });
        }

        let n_samples = x.nrows();

        if n_samples == 0 {
            return Ok(FittedHdbscan {
                labels_: Array1::zeros(0),
                probabilities_: Array1::zeros(0),
            });
        }

        if n_samples < self.min_cluster_size {
            // Not enough points to form any cluster: all noise.
            return Ok(FittedHdbscan {
                labels_: Array1::from_elem(n_samples, -1isize),
                probabilities_: Array1::from_elem(n_samples, F::zero()),
            });
        }

        // Step 1: Compute core distances.
        let core_dists = compute_core_distances(x, min_samples);

        // Step 2-3: Build MST of mutual reachability graph.
        let mut mst_edges = build_mst(x, &core_dists);

        // Step 4-5: Build condensed tree and extract clusters.
        let (labels, probabilities) = extract_clusters(
            n_samples,
            &mut mst_edges,
            self.min_cluster_size,
            self.cluster_selection_epsilon,
        );

        Ok(FittedHdbscan {
            labels_: labels,
            probabilities_: probabilities,
        })
    }
}

impl<F: Float + Send + Sync + 'static> Hdbscan<F> {
    /// Fit on `x` and return the cluster labels for those samples in one
    /// call. Equivalent to sklearn `ClusterMixin.fit_predict`. Noise
    /// samples are labeled as `-1`.
    ///
    /// # Errors
    ///
    /// Forwards any error from [`Fit::fit`].
    pub fn fit_predict(&self, x: &Array2<F>) -> Result<Array1<isize>, FerroError> {
        let fitted = self.fit(x, &())?;
        Ok(fitted.labels().clone())
    }
}

#[cfg(test)]
mod tests {
    use super::*;
    use approx::assert_relative_eq;

    /// Two well-separated blobs.
    fn make_two_blobs() -> Array2<f64> {
        Array2::from_shape_vec(
            (12, 2),
            vec![
                // Cluster A near (0, 0)
                0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 0.1, 0.1, 0.05, 0.05, -0.05, 0.05,
                // Cluster B near (10, 10)
                10.0, 10.0, 10.1, 10.0, 10.0, 10.1, 10.1, 10.1, 10.05, 10.05, 9.95, 10.05,
            ],
        )
        .unwrap()
    }

    /// Make a dense cluster dataset for HDBSCAN testing.
    fn make_dense_clusters() -> Array2<f64> {
        // Two dense clusters, well-separated.
        Array2::from_shape_vec(
            (20, 2),
            vec![
                // Cluster A: 10 points tightly packed near (0, 0)
                0.0, 0.0, 0.05, 0.0, 0.0, 0.05, 0.05, 0.05, -0.05, 0.0, 0.0, -0.05, -0.05, -0.05,
                0.03, 0.02, -0.02, 0.03, 0.04, -0.01,
                // Cluster B: 10 points tightly packed near (5, 5)
                5.0, 5.0, 5.05, 5.0, 5.0, 5.05, 5.05, 5.05, 4.95, 5.0, 5.0, 4.95, 4.95, 4.95, 5.03,
                5.02, 4.98, 5.03, 5.04, 4.99,
            ],
        )
        .unwrap()
    }

    #[test]
    fn test_two_clusters() {
        let x = make_two_blobs();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(3);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        assert_eq!(labels.len(), 12);

        // First 6 should be in the same cluster.
        let first_label = labels[0];
        assert!(
            first_label >= 0,
            "expected cluster, got noise for first blob"
        );
        for i in 0..6 {
            assert_eq!(
                labels[i], first_label,
                "point {i} should be in the same cluster as point 0"
            );
        }

        // Last 6 should be in the same cluster.
        let second_label = labels[6];
        assert!(
            second_label >= 0,
            "expected cluster, got noise for second blob"
        );
        for i in 6..12 {
            assert_eq!(
                labels[i], second_label,
                "point {i} should be in same cluster as point 6"
            );
        }

        // The two clusters should have different labels.
        assert_ne!(first_label, second_label);
        assert_eq!(fitted.n_clusters(), 2);
    }

    #[test]
    fn test_noise_detection() {
        // Two tight blobs + outliers.
        let x = Array2::from_shape_vec(
            (14, 2),
            vec![
                // Cluster A
                0.0, 0.0, 0.1, 0.0, 0.0, 0.1, 0.1, 0.1, 0.05, 0.05, // Cluster B
                10.0, 10.0, 10.1, 10.0, 10.0, 10.1, 10.1, 10.1, 10.05, 10.05, // Outliers
                50.0, 50.0, -50.0, -50.0, 100.0, 0.0, 0.0, 100.0,
            ],
        )
        .unwrap();

        let model = Hdbscan::<f64>::new().with_min_cluster_size(3);
        let fitted = model.fit(&x, &()).unwrap();

        let labels = fitted.labels();
        // Outliers should be noise.
        assert_eq!(labels[10], -1, "outlier at (50,50) should be noise");
        assert_eq!(labels[11], -1, "outlier at (-50,-50) should be noise");
        assert_eq!(labels[12], -1, "outlier at (100,0) should be noise");
        assert_eq!(labels[13], -1, "outlier at (0,100) should be noise");

        // Cluster points should not be noise.
        assert!(labels[0] >= 0, "cluster A point should not be noise");
        assert!(labels[5] >= 0, "cluster B point should not be noise");
    }

    #[test]
    fn test_min_cluster_size_effect() {
        let x = make_two_blobs();

        // With small min_cluster_size, should find clusters.
        let model_small = Hdbscan::<f64>::new().with_min_cluster_size(2);
        let fitted_small = model_small.fit(&x, &()).unwrap();
        assert!(
            fitted_small.n_clusters() >= 1,
            "should find at least 1 cluster"
        );

        // With very large min_cluster_size, everything becomes noise.
        let model_large = Hdbscan::<f64>::new().with_min_cluster_size(100);
        let fitted_large = model_large.fit(&x, &()).unwrap();
        for &label in fitted_large.labels() {
            assert_eq!(
                label, -1,
                "all points should be noise with large min_cluster_size"
            );
        }
    }

    #[test]
    fn test_probabilities_range() {
        let x = make_two_blobs();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(3);
        let fitted = model.fit(&x, &()).unwrap();

        for (i, &prob) in fitted.probabilities().iter().enumerate() {
            assert!(
                (0.0..=1.0).contains(&prob),
                "probability at index {i} is {prob}, expected [0, 1]"
            );
        }

        // Noise points should have probability 0.
        for i in 0..x.nrows() {
            if fitted.labels()[i] == -1 {
                assert_relative_eq!(fitted.probabilities()[i], 0.0);
            }
        }
    }

    #[test]
    fn test_dense_clusters() {
        let x = make_dense_clusters();
        let model = Hdbscan::<f64>::new()
            .with_min_cluster_size(3)
            .with_min_samples(3);
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.labels().len(), 20);
        // Should find 2 clusters in the dense, well-separated data.
        assert_eq!(
            fitted.n_clusters(),
            2,
            "should find 2 clusters in well-separated dense data"
        );

        // Points 0-9 should be in one cluster, 10-19 in another.
        let first_label = fitted.labels()[0];
        assert!(first_label >= 0, "first cluster points should not be noise");
        for i in 0..10 {
            assert_eq!(
                fitted.labels()[i],
                first_label,
                "point {i} should be in cluster A"
            );
        }
        let second_label = fitted.labels()[10];
        assert!(
            second_label >= 0,
            "second cluster points should not be noise"
        );
        for i in 10..20 {
            assert_eq!(
                fitted.labels()[i],
                second_label,
                "point {i} should be in cluster B"
            );
        }
        assert_ne!(
            first_label, second_label,
            "two clusters should have different labels"
        );
    }

    #[test]
    fn test_empty_data() {
        let x = Array2::<f64>::zeros((0, 2));
        let model = Hdbscan::<f64>::new();
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.labels().len(), 0);
        assert_eq!(fitted.probabilities().len(), 0);
        assert_eq!(fitted.n_clusters(), 0);
    }

    #[test]
    fn test_single_point() {
        let x = Array2::from_shape_vec((1, 2), vec![1.0, 1.0]).unwrap();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(2);
        let fitted = model.fit(&x, &()).unwrap();

        assert_eq!(fitted.labels()[0], -1);
        assert_eq!(fitted.n_clusters(), 0);
    }

    #[test]
    fn test_too_few_for_cluster() {
        let x = Array2::from_shape_vec((3, 2), vec![0.0, 0.0, 0.1, 0.0, 0.0, 0.1]).unwrap();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(5);
        let fitted = model.fit(&x, &()).unwrap();

        // With min_cluster_size=5 and only 3 points, all should be noise.
        for &label in fitted.labels() {
            assert_eq!(label, -1);
        }
    }

    #[test]
    fn test_invalid_min_cluster_size() {
        let x = Array2::from_shape_vec((3, 2), vec![1.0, 1.0, 2.0, 2.0, 3.0, 3.0]).unwrap();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(1);
        let result = model.fit(&x, &());
        assert!(result.is_err());
    }

    #[test]
    fn test_cluster_selection_epsilon() {
        let x = make_two_blobs();

        // With very large epsilon, no splits should happen (everything is one blob or noise).
        let model = Hdbscan::<f64>::new()
            .with_min_cluster_size(3)
            .with_cluster_selection_epsilon(1000.0);
        let fitted = model.fit(&x, &()).unwrap();

        // Should either be all one cluster or all noise (depending on tree structure).
        let n = fitted.n_clusters();
        assert!(
            n <= 1,
            "with large epsilon, should have at most 1 cluster, got {n}"
        );
    }

    #[test]
    fn test_f32_support() {
        let x = Array2::from_shape_vec(
            (10, 2),
            vec![
                0.0f32, 0.0, 0.1, 0.0, 0.0, 0.1, 0.1, 0.1, 0.05, 0.05, 10.0, 10.0, 10.1, 10.0,
                10.0, 10.1, 10.1, 10.1, 10.05, 10.05,
            ],
        )
        .unwrap();

        let model = Hdbscan::<f32>::new().with_min_cluster_size(3);
        let fitted = model.fit(&x, &()).unwrap();
        assert_eq!(fitted.labels().len(), 10);
    }

    #[test]
    fn test_identical_points() {
        let x = Array2::from_shape_vec(
            (6, 2),
            vec![1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
        )
        .unwrap();

        let model = Hdbscan::<f64>::new().with_min_cluster_size(2);
        let fitted = model.fit(&x, &()).unwrap();

        // All identical points should be in the same cluster or all noise.
        let labels = fitted.labels();
        let first = labels[0];
        for &l in labels {
            assert_eq!(l, first, "identical points should have the same label");
        }
    }

    #[test]
    fn test_n_clusters_accessor() {
        let x = make_two_blobs();
        let model = Hdbscan::<f64>::new().with_min_cluster_size(3);
        let fitted = model.fit(&x, &()).unwrap();
        let n = fitted.n_clusters();
        assert!(n > 0, "should find clusters");
    }

    #[test]
    fn test_default_constructor() {
        let model = Hdbscan::<f64>::default();
        assert_eq!(model.min_cluster_size, 5);
        assert!(model.min_samples.is_none());
    }
}