leibniz 0.2.0 - Docs.rs

# Profiling

Profiling uses [`samply`][samply] and writes profiles to `target/profiles`.
Profile with the default iteration count, the same five iterations as
benchmarking, so profiles and benchmark log entries describe the same run
shape:

```shell
make profile BENCHMARK=evaluate
```

The profile target uses Samply's presymbolication sidecar option, so it writes
both `target/profiles/evaluate.json` and `target/profiles/evaluate.syms.json`.
Open a saved profile in the profiler UI with `samply load`:

```shell
samply load target/profiles/evaluate.json
```

The same `FEATURES`, `BENCHMARK`, and `ITERATIONS` variables accepted by
`benchmark` are accepted by `profile`.

```text
ndarray   evaluate-224-256-64  ..... 13.007686592s ± 31.650114ms
```

The current profile is after the discriminant fold and the signal footprint
hoist (completed items 20 and 21), on top of the batched windings, the
stacked quadratic roots, the reciprocal multiplication, the column-native
boundary geometry with gather lookups, and the hoisted sample-row reshape
(completed items 13 through 19). The run above recorded 64,985 samples over
five iterations under samply; the un-instrumented benchmark recorded 12.797s
after item 21. Three repeated runs under varying machine load left every
owner share within half a percentage point, so the tables below are robust to
the wall-time wobble visible across runs.

## Profile

The source-level winding path through boundary differentiation is:

```text
Loss::evaluate
  boundary::differentiate
    gradient::jumps
      winding::contains
        winding::evaluate
          winding::evaluate_contour
            winding::{linear,quadratic}_winding
              winding::{linear,quadratic}::evaluate
                roots::{linear,quadratic}::solve
```

The percentage tables below are flat sample-share lists. Rows are not children
of the row above them, and rows are not expected to sum to 100%.

The saved profile JSON is unsymbolicated, and the sibling `.syms.json` file
contains the preserved symbols for sampled executable addresses. Inline-only
helpers can disappear as separate sampled frames.

The main top-level local regions under `Loss::evaluate` are:

```text
boundary::differentiate                                89.01%
raster::render                                         10.73%
```

The winding frames own 73.53% of all samples, split into 63.66% through
`boundary::differentiate` and 9.87% through `raster::render`. Boundary
samples that do not pass through winding account for 25.35% of all samples.
Attributing each sample to its deepest local frame (in this build,
`linear::evaluate` is inlined into `evaluate_contour`):

```text
roots::quadratic::solve                                25.00%
winding::evaluate_contour (linear path and shared)     21.89%
winding::quadratic::evaluate                           20.92%
boundary::gradient::segment_gradients                   7.76%
boundary::distribution::Distribution::sample            5.85%
boundary::geometry::evaluate                            5.59%
roots::linear::solve                                    4.88%
boundary::signal::sample                                2.46%
boundary::signal::axis                                  1.20%
```

After the gather lookups in item 17 and the reciprocal multiplication in item
19, `select`, `slice`, and `div` are gone from the operation totals; the
remaining cost is almost entirely element-proportional arithmetic in winding
root solving and contribution evaluation.

Exclusive samples in local functions remain negligible; no local frame owns
more than 0.04% of samples directly. The local code mostly owns chains of
Burn tensor operations rather than doing scalar work itself.

The Burn operation costs below resolve almost entirely into ndarray's
allocate-and-sweep kernels rather than scalar arithmetic. Each whole tensor
operation allocates a fresh output buffer and iterates every sample once, so the
largest self-sample owners are the element-wise collectors and the allocator,
not the operation symbols:

```text
binary element-wise -> new array (collect_with_partial)  48.49%
in-place element-wise (Zip::for_each)                     15.61%
ArrayRef::map (unary)                                      6.00%
memmove/memcpy                                             5.29%
ternary and unary collect_with_partial                     3.54%
unary map -> Vec (to_vec_mapped)                           2.62%
malloc/free                                                1.97%
alloc constant (TensorData::full)                          1.29%
```

Allocation and full-array materialization together account for about 69% of
all samples, with another 15% in in-place sweeps; the `gather` and
`scatter_nd` kernels own a further 8% as indexed loops rather than sweeps.
The operation count is therefore still the cost driver: each operation is
roughly one allocation and one full pass over the samples, while the
arithmetic itself is nearly free. The operation tables that follow attribute
these same samples to the Burn operations that own them.

The visible Burn NdArray operation costs are:

```text
mul                                                    25.44%
sub                                                    12.29%
bool_and                                                9.77%
add                                                     9.18%
reshape                                                 6.32%
gather                                                  4.84%
bool_or                                                 4.32%
scatter_nd                                              3.52%
greater_equal                                           2.41%
greater                                                 2.04%
sum_dim                                                 1.73%
mask_where                                              1.59%
cat                                                     1.59%
greater_equal_elem                                      1.38%
lower_equal_elem                                        1.09%
```

For the hottest local owners, visible backend costs are:

```text
roots::quadratic::solve
  mul                                                   5.66%
  bool_and                                              4.35%
  bool_or                                               4.32%
  add                                                   3.77%
  sub                                                   1.90%
  reshape                                               1.47%
  mask_where                                            1.35%

winding::evaluate_contour
  sub                                                   6.09%
  mul                                                   5.80%
  bool_and                                              4.21%
  sum_dim                                               1.71%
  greater                                               0.94%

winding::quadratic::evaluate
  mul                                                   8.22%
  add                                                   3.83%
  sub                                                   3.69%
  greater                                               1.10%
  bool_and                                              0.99%
  reshape                                               0.93%

boundary::gradient::segment_gradients
  scatter_nd                                            3.52%
  reshape                                               1.71%
  cat                                                   1.17%

boundary::geometry::evaluate
  gather                                                2.63%
  reshape                                               0.94%

roots::linear::solve
  mul                                                   4.78%

boundary::distribution::Distribution::sample
  greater_equal                                         2.41%
  gather                                                0.65%

boundary::signal::sample
  gather                                                1.57%
```

Reading the path against the code, the remaining cost is the stacked winding
arithmetic itself. In `roots::quadratic::solve`, the `bool_or`, `bool_and`,
and `mask_where` rows are the stacked validity bookkeeping, and the `mul` and
`add` rows are the discriminant and root construction. In the contribution
passes, `mul`, `add`, and `sub` are the polynomial evaluation, offsets, and
crossing comparisons shared across the stacked roots. The linear solver has
collapsed to a single sample-sized multiplication, and the signal residue is
nearly pure footprint gathers. The `reshape` rows are the unsqueeze copies
whose removal the step 7 measurement showed to be offset by replacement
costs.

## Completed

1. Boundary signal sampling precomputes the weighted image before gathering,
   reducing repeated `float_select` work in `signal::sample`.

2. Boundary distribution sampling compares CDF values and samples through
   broadcasting instead of repeated grids, removing the main `repeat_dim` cost
   from `Distribution::sample`.

3. Root winding contributions now use boolean-to-float conversion and a folded
   crossing sign instead of two float mask writes.

4. Quadratic root solving now builds validity masks directly and avoids
   zero-filled helper tensors.

5. Segment-gradient accumulation broadcasts weights and normals instead of
   materializing repeated tensors before scattering.

6. Radius-0.5 signal sampling now scans only the two-by-two candidate footprint
   that can pass the filter predicate.

7. Raster and winding now pass coordinate columns directly, avoiding temporary
   sample matrices along the inside-outside path.

8. Boundary geometry now carries point and normal columns, and boundary signal
   sampling and jump evaluation consume those columns directly.

9. Boundary geometry now computes quadratic basis columns directly instead of
   building a basis matrix and slicing it back into columns.

10. Quadratic root solving now shares denominator work across the two quadratic
    roots, removing one sample-sized division.

11. Winding root solving now separates linear and quadratic validity so root
    contribution can derive crossing signs without evaluating the y-derivative
    at each root.

12. Winding evaluation now dispatches linear commands through a linear root
    solver while keeping quadratic commands on the quadratic path.

13. Boundary jump evaluation now classifies the minus and plus sample sets in
    one winding pass over the concatenated sample columns instead of two, which
    dropped the evaluate benchmark from 18.795s to 17.696s. This is step 1,
    option (a) of the plan; a profile A/B attributed the gain to fewer
    allocations and winding dispatch rather than less lane arithmetic.

14. Winding evaluation now batches across a contour's segments instead of
    looping per segment: it extracts each coordinate column once, groups
    segments by command, and broadcasts the per-segment coefficient columns
    against the sample row into a segment-by-sample tensor summed back over the
    segment axis. The evaluate benchmark dropped from 18.003s to 16.374s, and a
    profile A/B attributed the gain to collapsed per-segment accumulation, fewer
    allocations, and less coordinate slicing; raster classification batches the
    same way and improved too. This is step 2 of the plan. Chunking was measured
    to be inert at the benchmark's 1x1 sampling and left out for simplicity.

15. Linear root solving now negates the per-segment divisor column instead of
    the sample-sized constant tensor, removing one full-size pass with
    bit-identical results. The evaluate benchmark dropped from 16.255s to
    16.054s, about 1.2%, against a same-session baseline, in line with the
    0.98% negation `mul` the profile attributed under the solver. This is
    step 5 of the plan.

16. Quadratic winding now evaluates both roots in one stacked pass:
    `roots::quadratic::solve` emits `[2, segments, samples]` roots and
    validity, building the stack from broadcast columns (signed per-segment
    denominator columns carry the root signs, so the roots cost one division
    and one addition instead of a `cat`), and the contribution shares the
    curve evaluation, range checks, and crossing comparison across the roots,
    with the per-root sign reduced to per-segment columns. The evaluate
    benchmark dropped from 16.054s to 15.905s, about 0.9%, against a
    same-session baseline. A profile A/B confirmed the targeted classes:
    `mask_where` fell from 2.79% to 1.37% and `sub` from 18.86% to 9.62%,
    partly offset by the predicted bookkeeping in broadcast `or`, `div`,
    `add`, and `reshape`. This is step 3 of the plan.

17. Boundary geometry is now column-native, and sample-indexed
    one-dimensional lookups use `gather` instead of `select`. Control point
    coordinates are gathered as columns straight from the flattened
    arguments, and points, tangents, normals, and lengths are computed as
    columns, removing the basis and mask expansions, the `[samples, 2]`
    point gathers, and the column slice-backs, along with the
    `tensor::column`, `tensor::expand`, and `point::row_lengths` helpers. On
    its own the column rework was nearly inert, because the `select` that
    replaced the gathers is slow in the NdArray backend: it funnels the index
    tensor through a per-element iterator into a vector before the lookup. An
    isolated probe later measured select at 14.0 nanoseconds per element
    against 3.0 for gather, about a factor of 4.7, independent of index
    buffer sharing. Switching the coordinate
    lookups, the signal footprint gathers, the distribution CDF and PMF
    lookups, and the linear command mask to `gather` dropped the evaluate
    benchmark from 15.587s ± 8.9ms to 13.506s ± 84.3ms, about 13%, against a
    same-session baseline. The profile A/B shows `select` disappearing from
    the operation table entirely with `gather` at 4.56%, and the boundary
    owners halving: `geometry::evaluate` from 10.85% plus 3.54% in
    `left_normals` to 5.38%, `signal::sample` from 8.89% to 4.57%, and
    `Distribution::sample` from 7.07% to 5.48%. This addresses candidate 4.

18. Winding evaluation now reshapes the sample columns into rows once per
    classification, outside the contour loop, instead of once per contour.
    This is the offset-free subset of the step 7 rework: the backend reshape
    is a copy, so the per-contour reshape copied both sample columns once per
    contour. The evaluate benchmark dropped from 13.506s ± 84.3ms to
    13.414s ± 5.9ms, about 0.7%, with a confirming rerun at 13.408s ± 22.5ms.

19. Winding root solving now multiplies by reciprocal per-segment columns
    instead of dividing the sample-sized tensors: the linear solver multiplies
    the constant by the negated reciprocal divisor, and the quadratic solver
    shares one reciprocal denominator across the stacked roots. The product
    can deviate from the quotient in the last ulp, the first accepted
    departure from the reference arithmetic. The evaluate benchmark dropped
    from 13.414s ± 5.9ms to 12.936s ± 9.8ms, about 3.6%, against a
    same-session baseline, and `div` disappeared from the profile's operation
    totals (11.95% before), partly returning as cheaper `mul` (16.73% to
    25.30%).

20. The quadratic discriminant now scales the per-segment column instead of
    the sample-sized product, computing `b^2 - (a * 4) * c` and removing one
    sample-sized pass per quadratic winding call; the factor is a power of
    two, so the fold is exact. With machine noise elevated, a back-to-back
    paired comparison measured 13.117s ± 69.4ms before and 12.948s ± 30.7ms
    after, about 1%.

21. Boundary signal sampling now hoists the filter predicate, the bounds
    checks, and the integer indices of the box-filter footprint out of the
    cell loop: a whole row or column of cells shares its axis, so each is
    computed once per offset instead of once per cell, and one zero tensor
    is reused across the cells. The masks and indices are bit-identical to
    the per-cell versions. A paired comparison in a calm window measured
    12.947s ± 4.3ms before and 12.797s ± 7.3ms after, about 1.2%.

22. Segment-gradient scatters now use one-component indices into the
    gradients flattened to `[segments * 4]`, with the flat coordinate
    indices derived arithmetically from flat point index columns that
    boundary geometry carries instead of `[samples, 2]` index matrices. The
    scatter kernel itself is clean, but it pays an index read and a stride
    multiplication per index component per update, so the former
    three-component tuples cost three times the lookup work, and the zero
    and one index columns and the `[3 * samples, 3]` index concatenations
    that built them are gone as well. The accumulation order is unchanged,
    so the gradients are bit-identical. A paired comparison measured
    12.810s ± 22.3ms before and 12.473s ± 96.2ms after, about 2.6%.

## Candidates

These are the detailed observations the plan below draws on and prioritizes;
the plan is the sequenced, actionable version.

1. Reduce whole tensor operations in the quadratic winding contribution.

   `quadratic_winding` is the largest local owner at 29.26%, with
   `evaluate_polynomial_without_constant` owning another 6.67%. One call costs
   roughly forty `[segments, samples]` passes, and the two root contributions
   duplicate the polynomial evaluation, the range checks, and the curve-x
   comparison between them, about ten passes that stacking the roots would
   share. Because each pass is roughly one allocation and one full sweep over
   every sample (see the allocate-and-sweep costs under Profile), cutting the
   pass count cuts allocation and materialization one for one. The plan
   covered this as step 3, now completed item 16.

2. Treat root solving as part of the winding contribution path.

   `roots::quadratic::solve` is inlined into `quadratic_winding` and its cost
   appears there. `roots::linear::solve` is visible at 4.86%, almost entirely
   the division at 3.83% plus a 0.98% negation pass that step 5 removed
   (completed item 15). Avoiding root solving for samples or segments that
   cannot contribute stays deprioritized: samples span the whole image, so
   nearly every segment overlaps, and the eager backend cannot skip individual
   lanes cheaply. Keep it only as a later option if a sparser sampling regime
   or a non-eager backend changes that arithmetic.

3. Batch boundary jump classification across a layout's contours. Measured
   and rejected.

   `boundary::differentiate` calls `gradient::jumps` once per contour, and
   each call evaluates the winding of all the layout's contours against that
   contour's `2N` sample lanes, so per-operation fixed overhead scales with
   the square of the contour count. Concatenating the boundary sample columns
   of all contours and classifying once per layout pays coefficient
   construction and dispatch once instead of per contour, but the measured
   variant regressed the benchmark; see step 6 for the result and the cache
   explanation.

4. Inspect the boundary select and slice block after winding. Addressed by
   completed item 17.

   Boundary work that does not pass through winding was 37.67% of all
   samples, dominated by slice-backs, expansions, and the slow
   one-dimensional `select`. The column-native geometry and the gather swap
   removed most of it. What remains afterwards is `segment_gradients` (7.35%,
   scatter-bound), `Distribution::sample` (5.48%, mostly the segment-by-
   sample CDF comparison and argmax), `geometry::evaluate` (5.38%), and
   `signal::sample` (4.57%, four footprint gathers and masks).

5. Treat full boundary linear specialization as speculative.

   The clearest direct targets are the repeated linear mask in normal
   evaluation, one third of control-point gathers for linear samples, and one
   third of gradient scatter updates for linear samples. Splitting mixed
   sample records may add enough selection and concatenation work to erase
   much of that gain.

6. Keep segment-gradient scatter low priority unless it simplifies code.

   A single vector-valued scatter variant was tested and did not improve the
   benchmark. The current two-scatter layout leaves `scatter_nd` at 2.77%, so
   there is not enough signal to prioritize it on performance alone.

## Plan

The profile splits into two cost components. Element-proportional work scales
with `segments * samples * operations-per-segment` and shrinks only by cutting
operations per segment or the number of (segment, sample) pairs evaluated. Per
operation fixed overhead (`slice`, `squeeze_dim`, `malloc`, part of `from_iter`,
accumulation adds, and dispatch) scales with the operation count and shrinks by
issuing fewer, larger operations. The NdArray backend is eager and does not
fuse, and the fusing backends are far slower here, so the operation count has to
come down by hand rather than through the backend.

The hot dimensions and the structural multipliers that drive the operation count
are:

```text
samples per contour  N = height * width * x_samples * y_samples = 64*64*1*1 = 4096
per-segment coefficients are [segments, 1] columns broadcast over N
jumps classifies minus and plus coverage in one batched pass over 2N lanes
evaluate_contour batches a contour's segments into [segments, N] passes
quadratic::evaluate evaluates both roots in one stacked [2, segments, N] pass
differentiate calls jumps per contour, so winding repeats per contour pair
layouts hold one glyph in isolation, so the contour count is one to three
```

Batching does not undo the completed work that removed slicing by keeping
coordinates separated (items 7 to 9 above, and benchmark entries 92880de and
5324746). That work separates the coordinate axis: distinct quantities such as
x, y, normals, basis weights, and polynomial coefficients each stay a column
rather than being packed into a matrix that is sliced back on every use.
Batching packs the lane axis instead: homogeneous indices over which the same
work repeats, such as samples, the minus and plus pair, segments, and roots. The
two are orthogonal. A batched column is a longer or wider version of a separated
column, the columns stay separated inside the batch (concatenate x with x and y
with y, never x with y), and the matrices that appear are reduced by summation
rather than sliced back into coordinates. Steps 1 and 2 in fact lower the
coordinate slicing, because they stop rebuilding per-segment coefficients on
redundant passes; the only literal splits are the flat result splits in steps
1 and 6.

The steps are ordered by leverage over risk.

1. Done (option (a); completed item 13). Batching the minus and plus lane in
   `gradient::jumps` dropped the evaluate benchmark from 18.795s to 17.696s,
   about 5.6%. It concatenates each coordinate column along the sample axis (x
   with x, y with y), classifies once over the `2N` columns, then splits the
   flat result back and subtracts. The profile A/B confirmed the prediction:
   fewer allocations and winding dispatch, lane arithmetic untouched. Option
   (b), hoisting coefficients without concatenation, was not needed; (a)
   subsumes its saving.

2. Done (completed item 14). Batching a contour's segments in
   `evaluate_contour` dropped the evaluate benchmark from 18.003s to 16.374s,
   about 9%. It extracts each coordinate column once (the end column is a cyclic
   shift of the start column), groups segments by command to keep the linear
   two-point and quadratic three-point paths, and broadcasts the `[segments, 1]`
   coefficient columns against the `[1, samples]` row, then sums the resulting
   `[segments, samples]` tensor back over the segment axis. Those coordinate
   vectors are the separated-column shape completed items 8 and 9 use, and the
   matrix is summed away, not sliced back. The profile A/B confirmed the
   prediction: collapsed per-segment accumulation, fewer allocations, and less
   slicing, with the new `sum_dim` and grouping `select` small. Chunking was
   planned to bound the intermediates for cache residency, but a chunk-size
   sweep was inert at the benchmark's 1x1 sampling, so it was left out for
   simplicity; revisit only if a heavier-sampling benchmark shows the
   intermediates blowing cache.

3. Done (completed item 16). Sharing the two quadratic roots in one stacked
   `[2, segments, samples]` pass dropped the evaluate benchmark from 16.054s
   to 15.905s, about 0.9%, against a same-session baseline. The stack came
   from signed per-segment denominator columns rather than a `cat`, as
   planned, and the per-root sign collapsed to per-segment columns, removing
   the sample-sized sign `mask_where`. The predicted sign and validity
   bookkeeping offset part of the dispatch saving: the profile A/B shows
   `mask_where` and `sub` falling sharply while broadcast `or`, `div`, `add`,
   and `reshape` rise.

4. Done via step 2. Hoisting the coordinate-column extraction out of the
   per-segment loop is subsumed by the batched-segment pass, which extracts each
   coordinate column once per contour instead of slicing and squeezing per
   segment per coordinate.

5. Done (completed item 15). Folding the negation in `roots::linear::solve`
   into the divisor, dividing by the negated `safe_divisor(b)` column instead
   of negating the `[segments, samples]` constant, dropped the evaluate
   benchmark from 16.255s to 16.054s against a same-session baseline, about
   1.2%, matching the 0.98% `mul` the profile attributed to the negation. The
   results are bit-identical because IEEE division commutes the sign between
   numerator and denominator.

6. Measured and not kept. Batching boundary jump classification across a
   layout's contours (concatenating each boundary coordinate column across
   contours, x with x and y with y, and calling `gradient::jumps` once per
   layout with the flat result split back per contour) regressed the evaluate
   benchmark from 15.587s ± 8.9ms to 15.683s ± 2.4ms against a same-session
   baseline. The dispatch saving was real, with winding invocations falling
   from one per contour pair to one per contour, but every winding
   intermediate grew by the layout's contour count, from `[segments, 2N]` to
   `[segments, 2N * contours]` columns, and the benchmark's multi-glyph
   layouts carry tens of contours (the fixture averages three to four
   contours per glyph with several glyph instances per layout). The enlarged
   intermediates pushed the per-operation working set out of cache, and the
   element-proportional sweeps slowed by more than the fixed overhead saved;
   four `[contours * samples]` concatenations per layout were also added.
   This is the cache-blowup scenario the chunking note in step 2 anticipated.
   A cache-bounded variant that batches a few contours at a time remains
   possible if per-operation overhead grows dominant again.

7. Measured and not kept. The backend reshape is always a copy in Burn 0.21's
   NdArray backend: `reshape` goes through ndarray's `to_shape` followed by
   `into_shared`, and the borrowed view `to_shape` returns for a
   standard-layout array forces `into_shared` to clone the buffer, so every
   `reshape` and `unsqueeze` allocates and copies regardless of buffer
   uniqueness. Rebuilding the winding path so sample-sized tensors are born
   with the root axis through broadcasting (lifting only sample rows and
   per-segment columns) and hoisting the sample-row reshape out of the
   contour loop cut the reshape share from 5.38% to 3.57%, but the
   replacements cost the same: an extra segment-axis summation pass raised
   `sum_dim` from 1.41% to 2.41%, and the lifted rows still copy. The
   evaluate benchmark was inert at 15.581s ± 5.2ms against 15.587s ± 8.9ms,
   so the change was left out, following the chunking precedent in step 2.
   Its offset-free subset, hoisting the sample-row reshape out of the contour
   loop, was later kept as completed item 18.

8. Measured and not kept, twice over: the winding pass audit. Folding the
   crossing signs and the segment summation into a matrix product
   (`sign` transposed against the float validity, bit-identical because the
   contributions are exact zero and unit terms) regressed the evaluate
   benchmark from 12.956s ± 10.0ms to 13.018s ± 35.6ms in a paired
   comparison: a single-row product is a degenerate matmul that the kernel
   handles worse than the two streaming sweeps it replaces. Replacing the two
   parameter range comparisons with one clamp and one equality
   (`clamp(t, 0, 1) == t`) regressed it from 12.964s ± 5.6ms to
   14.411s ± 7.4ms, about 11%: the scalar comparisons are cheap bool-output
   passes, while the clamp materializes a float tensor and the equality reads
   two float operands. The audit's conclusion is that the winding pass
   structure is at its operation-count floor for this backend: scalar
   comparisons, boolean conjunctions, and the masked sum are each already in
   their cheapest available form, and algebraic folds that look like fewer
   operations cost more as tensors.

Deprioritize quadratic root algebra micro-optimizations (element-proportional
and already tight), segment-level geometric pruning (samples span the whole
image, so nearly every segment overlaps), and any reliance on backend fusion.
Contour-pair pruning in boundary winding is also deprioritized: layouts hold
one glyph in isolation, glyphs carry one to three mostly nested contours, and
the only zero-valued pair — a glyph's outer samples against a counter — is a
small term that needs a bounding-circle test to prove the separation that
bounding boxes cannot, while introducing data-dependent branching into a
uniform pipeline. Revisit only if compositions ever hold multiple glyphs.

Validate each step with `make benchmark BENCHMARK=evaluate` for timing and `make
profile` to confirm the targeted cost class actually dropped, then record the
result in BENCHMARKING.md and refresh this document.

## Direction

The Profile section above is current as of completed item 21; item 22 has
since flattened the segment-gradient scatter indices, and the step 8 audit
closed winding operation count as exhausted for this backend. The remaining
boundary residue is structurally sound, and the local optimization list is
exhausted pending the upstream fixes below.
Two backend findings have fixes in progress upstream, both verified on Burn's
main branch and pinned by an isolated probe: `select` converts the index
tensor through a per-element iterator before the lookup, measuring 14.0
nanoseconds per element against 3.0 for `gather`, addressed by
[pull request 5066][select], and `reshape` copies at memcpy bandwidth even
for uniquely owned standard-layout buffers (step 7), addressed by
[pull request 5067][reshape]. Once those land in a release, revisit the
gather workaround from item 17, since the fixed `select` measures faster than
`gather`, and the reshape economics behind steps 7 and 8 and item 18. For
boundary work, prefer small local cleanups over a full linear/quadratic
sample split unless a benchmark demonstrates that the split pays for its
selection overhead.

[reshape]: https://github.com/tracel-ai/burn/pull/5067
[samply]: https://github.com/mstange/samply
[select]: https://github.com/tracel-ai/burn/pull/5066