leibniz 0.1.0 - Docs.rs

# Profiling

Profiling uses [`samply`][samply] and writes profiles to `target/profiles`.
Run one iteration when profiling expensive benchmarks:

```shell
make profile BENCHMARK=evaluate ITERATIONS=1
```

The profile target uses Samply's presymbolication sidecar option, so it writes
both `target/profiles/evaluate.json` and `target/profiles/evaluate.syms.json`.
Open a saved profile in the profiler UI with `samply load`:

```shell
samply load target/profiles/evaluate.json
```

The same `FEATURES`, `BENCHMARK`, and `ITERATIONS` variables accepted by
`benchmark` are accepted by `profile`.

```text
ndarray   evaluate-224-256-64  . 19.612935959s
```

The current profile is after the command-level linear winding split, on top of
the direct quadratic basis columns, shared quadratic root denominator work,
column-native raster, boundary signal, and boundary jump changes.

## Profile

The source-level winding path through boundary differentiation is:

```text
Loss::evaluate
  boundary::differentiate
    gradient::jumps
      winding::contains
        winding::evaluate
          winding::evaluate_contour
            winding::{linear,quadratic}::evaluate
              roots::{linear,quadratic}::solve
```

The percentage tables below are flat sample-share lists. Rows are not children
of the row above them, and rows are not expected to sum to 100%.

The saved profile JSON is unsymbolicated, and the sibling `.syms.json` file
contains the preserved symbols for sampled executable addresses. Inline-only
helpers can disappear as separate sampled frames.

The main top-level local regions under `Loss::evaluate` are:

```text
boundary::differentiate                                90.73%
raster::render                                         8.96%
```

The main winding frames across both boundary differentiation and raster
rendering are:

```text
winding::evaluate                                     67.41%
winding::evaluate_contour                             67.16%
winding::linear::evaluate                             20.39%
winding::quadratic::evaluate_polynomial_without_constant 6.91%
roots::linear::solve                                   5.38%
winding::linear::coefficients                          3.55%
winding::quadratic::point_coefficients                 1.54%
roots::safe_divisor                                    1.04%
```

The linear path is the only per-command winding frame visible in this run; this
does not mean it outweighs the quadratic path, whose evaluate and solve work is
inlined into `winding::evaluate_contour` (see the contour row note below).

The visible winding samples split into 59.02% through
`boundary::differentiate` and 8.39% through `raster::render`.

Other local frames and operation groups under boundary differentiation are:

```text
select outside winding                                 8.01%
slice outside winding                                  7.22%
boundary::distribution::Distribution::sample           6.04%
point::left_normals                                    3.12%
scatter_nd                                             2.37%
repeat_dim                                             1.92%
gather_nd                                              1.10%
quadratic_basis                                        0.08%
```

The winding contour row uses `winding::evaluate_contour` as the representative
frame. In this run, `winding::linear::evaluate` and `roots::linear::solve` are
visible, but `winding::quadratic::evaluate`, `roots::quadratic::solve`, and
`boundary::signal::sample` are inlined or otherwise absent as separate sampled
symbols. Their work appears under `winding::evaluate_contour`,
`boundary::differentiate`, and the backend operation rows.

Exclusive samples in local functions remain tiny. The local code mostly owns
chains of Burn tensor operations rather than doing scalar work itself:

```text
winding::evaluate_contour                              0.30%
roots::safe_divisor                                    0.11%
winding::linear::evaluate                              0.08%
```

The Burn operation costs below resolve almost entirely into ndarray's
allocate-and-sweep kernels rather than scalar arithmetic. Each whole tensor
operation allocates a fresh output buffer and iterates every sample once, so the
largest self-sample owners are the element-wise collectors and the allocator,
not the operation symbols:

```text
binary element-wise -> new array (collect_with_partial)  30.61%
in-place element-wise (Zip::for_each)                     12.57%
Vec::from_iter (collect)                                   8.98%
unary map -> Vec (to_vec_mapped)                           7.76%
malloc/free                                                6.02%
memmove/memcpy                                             3.73%
ArrayRef::map (unary)                                      3.73%
alloc constant (TensorData::full)                          1.70%
```

Allocation and full-array materialization together account for about 63% of all
samples. The operation count is therefore the cost driver: each operation is
roughly one allocation and one full pass over the samples, while the arithmetic
itself is nearly free. The operation tables that follow attribute these same
samples to the Burn operations that own them.

The visible Burn NdArray operation costs are:

```text
sub                                                    13.91%
mul                                                    10.11%
bool_and                                                8.79%
select                                                  8.01%
slice                                                   7.59%
add                                                     7.31%
div                                                     4.74%
squeeze_dim                                             3.61%
reshape                                                 3.43%
mask_where                                              2.82%
cat                                                     2.42%
scatter_nd                                              2.36%
repeat_dim                                              2.17%
gather_nd                                               1.10%
```

For the hottest local owners, visible backend costs are:

```text
winding::evaluate_contour
  sub                                                  13.62%
  mul                                                  11.15%
  bool_and                                              8.58%
  add                                                   6.97%
  div                                                   4.66%
  greater                                               3.38%
  squeeze_dim                                           3.35%
  mask_where                                            2.72%
  reshape                                               1.76%
  bool_or                                               1.74%
  slice                                                 1.62%

winding::quadratic::evaluate_polynomial_without_constant
  mul                                                   3.64%
  add                                                   2.96%

point::left_normals
  slice                                                 3.03%
```

The command-level linear winding split moved the benchmark substantially, but
the profile is still dominated by winding inside boundary jump evaluation.
Boundary samples that do not pass through winding account for 31.70% of all
samples. The parts most directly affected by linear boundary specialization are
smaller: boundary `repeat_dim` is 1.92%, `gather_nd` is 1.10%, boundary
non-winding `mask_where` is 0.12%, and `quadratic_basis` is 0.08%.
Segment-gradient `scatter_nd` and related concatenation are larger, but a
linear two-point path can only remove the control-point portion of those
updates.

## Completed

1. Boundary signal sampling precomputes the weighted image before gathering,
   reducing repeated `float_select` work in `signal::sample`.

2. Boundary distribution sampling compares CDF values and samples through
   broadcasting instead of repeated grids, removing the main `repeat_dim` cost
   from `Distribution::sample`.

3. Root winding contributions now use boolean-to-float conversion and a folded
   crossing sign instead of two float mask writes.

4. Quadratic root solving now builds validity masks directly and avoids
   zero-filled helper tensors.

5. Segment-gradient accumulation broadcasts weights and normals instead of
   materializing repeated tensors before scattering.

6. Radius-0.5 signal sampling now scans only the two-by-two candidate footprint
   that can pass the filter predicate.

7. Raster and winding now pass coordinate columns directly, avoiding temporary
   sample matrices along the inside-outside path.

8. Boundary geometry now carries point and normal columns, and boundary signal
   sampling and jump evaluation consume those columns directly.

9. Boundary geometry now computes quadratic basis columns directly instead of
   building a basis matrix and slicing it back into columns.

10. Quadratic root solving now shares denominator work across the two quadratic
    roots, removing one sample-sized division.

11. Winding root solving now separates linear and quadratic validity so root
    contribution can derive crossing signs without evaluating the y-derivative
    at each root.

12. Winding evaluation now dispatches linear commands through a linear root
    solver while keeping quadratic commands on the quadratic path.

## Candidates

These are the detailed observations the plan below draws on and prioritizes;
the plan is the sequenced, actionable version.

1. Reduce whole tensor operations in winding segment contribution.

   `winding::evaluate_contour` is still the largest local owner. The former
   `evaluate_root_contribution` target is no longer visible as a sampled
   symbol, but the surrounding contour path still owns complete add, subtract,
   multiply, boolean, mask, and division passes. Future work should remove
   whole tensor operations across the segment contribution, for example by
   sharing root validity and curve evaluation across roots or combining root
   contribution work. Because each such pass is roughly one allocation and one
   full sweep over every sample (see the allocate-and-sweep costs under
   Profile), cutting the operation count cuts allocation and materialization one
   for one.

2. Revisit root solving as part of the winding contribution path.

   `roots::quadratic::solve` is not visible as a separate frame in this run,
   while `roots::linear::solve` is visible at 5.38%. Root-solving work should
   be treated as part of `winding::evaluate_contour` unless a later profile
   preserves the quadratic solver as its own frame. The chosen lever is sharing
   the two quadratic roots in one pass (see the plan, step 3). Avoiding root
   solving for samples or segments that cannot contribute is deprioritized:
   samples span the whole image, so nearly every segment overlaps, and the eager
   backend cannot skip individual lanes cheaply. Keep it only as a later option
   if a sparser sampling regime or a non-eager backend changes that arithmetic.

3. Reduce winding segment and coordinate slicing.

   Winding still pays visible `slice`, `squeeze_dim`, and shape-check costs
   around point coordinate extraction and coefficient construction. A useful
   change would carry per-coordinate segment data in the shape needed by
   winding instead of repeatedly slicing scalar coordinate tensors from the
   contour tensor. The plan covers this as step 4, subsumed by the
   batched-segment pass in step 2.

4. Treat full boundary linear specialization as speculative.

   Boundary non-winding work is visible, but the specific costs that a full
   linear/quadratic sample split would remove are modest. The clearest direct
   targets are the repeated linear mask in normal evaluation, one third of
   control-point gathers for linear samples, and one third of gradient scatter
   updates for linear samples. Splitting mixed sample records may add enough
   selection and concatenation work to erase much of that gain.

5. Inspect boundary select-heavy sampling later.

   `boundary::signal::sample` is not visible as a separate frame in this run,
   but boundary non-winding work still owns 8.01% in `select` and
   `Distribution::sample` accounts for 6.04%. Its ceiling is lower than
   winding/root work, so it should stay behind the winding candidates.

6. Keep segment-gradient scatter low priority unless it simplifies code.

   A single vector-valued scatter variant was tested and did not improve the
   benchmark. The current two-scatter layout leaves `scatter_nd` around 2.37%,
   so there is not enough signal to prioritize it on performance alone.

## Plan

The profile splits into two cost components. Element-proportional work scales
with `segments * samples * operations-per-segment` and shrinks only by cutting
operations per segment or the number of (segment, sample) pairs evaluated. Per
operation fixed overhead (`slice`, `squeeze_dim`, `malloc`, part of `from_iter`,
accumulation adds, and dispatch) scales with the operation count and shrinks by
issuing fewer, larger operations. The NdArray backend is eager and does not
fuse, and the fusing backends are far slower here, so the operation count has to
come down by hand rather than through the backend.

The hot dimensions and the structural multipliers that drive the operation count
are:

```text
samples per contour  N = height * width * x_samples * y_samples = 64*64*1*1 = 4096
per-segment coefficients are length-1 scalars broadcast over N
jumps evaluates the whole winding twice (minus and plus coverage)
evaluate_contour loops segments in Rust: one allocating N-pass plus one add each
quadratic::evaluate evaluates its two roots in separate passes
```

Batching does not undo the completed work that removed slicing by keeping
coordinates separated (items 7 to 9 above, and benchmark entries 92880de and
5324746). That work separates the coordinate axis: distinct quantities such as
x, y, normals, basis weights, and polynomial coefficients each stay a column
rather than being packed into a matrix that is sliced back on every use.
Batching packs the lane axis instead: homogeneous indices over which the same
work repeats, such as samples, the minus and plus pair, segments, and roots. The
two are orthogonal. A batched column is a longer or wider version of a separated
column, the columns stay separated inside the batch (concatenate x with x and y
with y, never x with y), and the matrices that appear are reduced by summation
rather than sliced back into coordinates. Steps 1 and 2 in fact lower the
coordinate slicing, because they stop rebuilding per-segment coefficients on
redundant passes; the only literal split is the single flat result split in
step 1.

The steps are ordered by leverage over risk.

1. Stop evaluating the minus and plus winding separately in `gradient::jumps`.
   Each `contains` call rebuilds every segment's coefficients, so the current
   two calls slice the contour arguments twice. Prefer (a) batching the minus
   and plus lane: extend each coordinate column along the sample axis (x with x,
   y with y), classify once over the `2N` columns, then split the flat result
   back with one slice before subtracting. Running winding once slices the
   coefficients once and halves the number of dispatched operations, since each
   runs over `2N` rather than twice over `N`. That cuts fixed per-operation
   overhead (allocation, iterator setup, coefficient slicing, Rust loop and API
   dispatch), not the lane arithmetic: the pass still sweeps `2N` elements, the
   same total as two `N` passes. The added cost is two sample-axis
   concatenations and one split, O(1) per evaluation against the per-operation
   overhead saved across every segment, so at `N = 4096` it should win. Option
   (a) subsumes the saving of (b) hoisting the coefficients (compute once,
   reuse across two element passes, no concatenation or split),
   so treat (b) as the escape hatch only if the benchmark shows (a)'s
   concatenations cost more than the dispatch they save. No winding API change;
   do first as the down payment before the larger refactor.

2. Batch segments within a contour, in fixed-size chunks. Replace the
   per-segment Rust loop in `evaluate_contour` with a chunked batched
   evaluation: group a contour's segments by command, and for each chunk of `C`
   segments (start with `C` around 8 to 32) extract each coordinate column as a
   `[C]` vector once, broadcast against the `[samples]` column into
   `[C, samples]`, run root solving and contribution once per chunk, sum over
   the segment axis, and accumulate into the running winding total. Chunk from
   the start rather than batching a whole contour at once: an unbounded
   `[segments, samples]` widens every intermediate and can fall out of L2/L3,
   which regresses the eager CPU backend even though the allocation count drops.
   A chunk caps each intermediate at `C * samples`, keeping it cache resident.
   The per-operation fixed cost amortizes as `fixed / (C * samples)`, so even a
   small `C` captures most of the dispatch win, favoring the low end of the
   range; the chunk size is a tunable to benchmark. Unlike step 1's minus and
   plus lane, which is a bounded factor of two, the segment lane is variable and
   large, so it is the one that must be tiled. This collapses the sequential
   per-segment passes, removes the per-segment `slice` and `squeeze_dim`, cuts
   the allocation count, and replaces the segment adds with one reduction per
   chunk. The `[C]` coordinate vectors are the same separated-column shape that
   completed items 8 and 9 already use for boundary points, normals, and basis
   weights, and the `[C, samples]` tensor is summed away, not sliced back,
   so this extends the no-slicing pattern to the segment axis. It composes with
   step 1 into a single `[C, 2N]` pass. This is the largest structural win but
   reshapes the winding API; land it behind the benchmark in one focused change.

3. Share the two quadratic roots in `winding::quadratic`. Evaluate both roots in
   one `[2, samples]` pass instead of two `[samples]` passes, ideally by having
   `roots::quadratic::solve` emit a `[2, samples]` root layout so the share does
   not pay for a `cat`. Halves the contribution operation count on the quadratic
   path. If reached through `cat`, the added memmove can erase the gain on the
   eager backend, so pursue it only with stacked roots and confirm with the
   benchmark. Even stacked, the two roots do not share sign or validity logic:
   the first carries the `mask_where` between the linear and quadratic signs
   while the second is unconditionally positive, so unifying them adds
   bookkeeping that can offset the dispatch saved. This is a further reason to
   keep it last and benchmark-gated. It makes the root-sharing half of candidate
   1 concrete.

4. If step 2 is deferred, hoist the coordinate-column extraction out of the
   per-segment loop on its own: extract each coordinate column once per contour
   and index per segment instead of slicing and squeezing per segment per
   coordinate. Step 2 subsumes this.

Deprioritize quadratic root algebra micro-optimizations (element-proportional
and already tight), segment-level geometric pruning (samples span the whole
image, so nearly every segment overlaps), and any reliance on backend fusion.

Validate each step with `make benchmark BENCHMARK=evaluate` for timing and `make
profile` to confirm the targeted cost class actually dropped, then record the
result in BENCHMARKING.md and refresh this document.

## Direction

Follow the steps in the plan above, starting with the batched minus and plus
winding in `gradient::jumps` and then the batched-segment winding pass. Treat
the quadratic root sharing as a measured follow-up. For boundary work, prefer
small local cleanups over a full linear/quadratic sample split unless a
benchmark demonstrates that the split pays for its selection overhead.

[samply]: https://github.com/mstange/samply