# Profiling
Profiling uses [`samply`][samply] and writes profiles to `target/profiles`.
Profile with the default iteration count, the same five iterations as
benchmarking, so profiles and benchmark log entries describe the same run
shape:
```shell
make profile BENCHMARK=evaluate
```
The profile target uses Samply's presymbolication sidecar option, so it writes
both `target/profiles/evaluate.json` and `target/profiles/evaluate.syms.json`.
Open a saved profile in the profiler UI with `samply load`:
```shell
samply load target/profiles/evaluate.json
```
The same `FEATURES`, `BENCHMARK`, and `ITERATIONS` variables accepted by
`benchmark` are accepted by `profile`.
```text
ndarray evaluate-224-256-64 ..... 13.007686592s ± 31.650114ms
```
The current profile is after the discriminant fold and the signal footprint
hoist (completed items 20 and 21), on top of the batched windings, the
stacked quadratic roots, the reciprocal multiplication, the column-native
boundary geometry with gather lookups, and the hoisted sample-row reshape
(completed items 13 through 19). The run above recorded 64,985 samples over
five iterations under samply; the un-instrumented benchmark recorded 12.797s
after item 21. Three repeated runs under varying machine load left every
owner share within half a percentage point, so the tables below are robust to
the wall-time wobble visible across runs.
## Profile
The source-level winding path through boundary differentiation is:
```text
Loss::evaluate
boundary::differentiate
gradient::jumps
winding::contains
winding::evaluate
winding::evaluate_contour
winding::{linear,quadratic}_winding
winding::{linear,quadratic}::evaluate
roots::{linear,quadratic}::solve
```
The percentage tables below are flat sample-share lists. Rows are not children
of the row above them, and rows are not expected to sum to 100%.
The saved profile JSON is unsymbolicated, and the sibling `.syms.json` file
contains the preserved symbols for sampled executable addresses. Inline-only
helpers can disappear as separate sampled frames.
The main top-level local regions under `Loss::evaluate` are:
```text
boundary::differentiate 89.01%
raster::render 10.73%
```
The winding frames own 73.53% of all samples, split into 63.66% through
`boundary::differentiate` and 9.87% through `raster::render`. Boundary
samples that do not pass through winding account for 25.35% of all samples.
Attributing each sample to its deepest local frame (in this build,
`linear::evaluate` is inlined into `evaluate_contour`):
```text
roots::quadratic::solve 25.00%
winding::evaluate_contour (linear path and shared) 21.89%
winding::quadratic::evaluate 20.92%
boundary::gradient::segment_gradients 7.76%
boundary::distribution::Distribution::sample 5.85%
boundary::geometry::evaluate 5.59%
roots::linear::solve 4.88%
boundary::signal::sample 2.46%
boundary::signal::axis 1.20%
```
After the gather lookups in item 17 and the reciprocal multiplication in item
19, `select`, `slice`, and `div` are gone from the operation totals; the
remaining cost is almost entirely element-proportional arithmetic in winding
root solving and contribution evaluation.
Exclusive samples in local functions remain negligible; no local frame owns
more than 0.04% of samples directly. The local code mostly owns chains of
Burn tensor operations rather than doing scalar work itself.
The Burn operation costs below resolve almost entirely into ndarray's
allocate-and-sweep kernels rather than scalar arithmetic. Each whole tensor
operation allocates a fresh output buffer and iterates every sample once, so the
largest self-sample owners are the element-wise collectors and the allocator,
not the operation symbols:
```text
binary element-wise -> new array (collect_with_partial) 48.49%
in-place element-wise (Zip::for_each) 15.61%
ArrayRef::map (unary) 6.00%
memmove/memcpy 5.29%
ternary and unary collect_with_partial 3.54%
unary map -> Vec (to_vec_mapped) 2.62%
malloc/free 1.97%
alloc constant (TensorData::full) 1.29%
```
Allocation and full-array materialization together account for about 69% of
all samples, with another 15% in in-place sweeps; the `gather` and
`scatter_nd` kernels own a further 8% as indexed loops rather than sweeps.
The operation count is therefore still the cost driver: each operation is
roughly one allocation and one full pass over the samples, while the
arithmetic itself is nearly free. The operation tables that follow attribute
these same samples to the Burn operations that own them.
The visible Burn NdArray operation costs are:
```text
mul 25.44%
sub 12.29%
bool_and 9.77%
add 9.18%
reshape 6.32%
gather 4.84%
bool_or 4.32%
scatter_nd 3.52%
greater_equal 2.41%
greater 2.04%
sum_dim 1.73%
mask_where 1.59%
cat 1.59%
greater_equal_elem 1.38%
lower_equal_elem 1.09%
```
For the hottest local owners, visible backend costs are:
```text
roots::quadratic::solve
mul 5.66%
bool_and 4.35%
bool_or 4.32%
add 3.77%
sub 1.90%
reshape 1.47%
mask_where 1.35%
winding::evaluate_contour
sub 6.09%
mul 5.80%
bool_and 4.21%
sum_dim 1.71%
greater 0.94%
winding::quadratic::evaluate
mul 8.22%
add 3.83%
sub 3.69%
greater 1.10%
bool_and 0.99%
reshape 0.93%
boundary::gradient::segment_gradients
scatter_nd 3.52%
reshape 1.71%
cat 1.17%
boundary::geometry::evaluate
gather 2.63%
reshape 0.94%
roots::linear::solve
mul 4.78%
boundary::distribution::Distribution::sample
greater_equal 2.41%
gather 0.65%
boundary::signal::sample
gather 1.57%
```
Reading the path against the code, the remaining cost is the stacked winding
arithmetic itself. In `roots::quadratic::solve`, the `bool_or`, `bool_and`,
and `mask_where` rows are the stacked validity bookkeeping, and the `mul` and
`add` rows are the discriminant and root construction. In the contribution
passes, `mul`, `add`, and `sub` are the polynomial evaluation, offsets, and
crossing comparisons shared across the stacked roots. The linear solver has
collapsed to a single sample-sized multiplication, and the signal residue is
nearly pure footprint gathers. The `reshape` rows are the unsqueeze copies
whose removal the step 7 measurement showed to be offset by replacement
costs.
## Completed
1. Boundary signal sampling precomputes the weighted image before gathering,
reducing repeated `float_select` work in `signal::sample`.
2. Boundary distribution sampling compares CDF values and samples through
broadcasting instead of repeated grids, removing the main `repeat_dim` cost
from `Distribution::sample`.
3. Root winding contributions now use boolean-to-float conversion and a folded
crossing sign instead of two float mask writes.
4. Quadratic root solving now builds validity masks directly and avoids
zero-filled helper tensors.
5. Segment-gradient accumulation broadcasts weights and normals instead of
materializing repeated tensors before scattering.
6. Radius-0.5 signal sampling now scans only the two-by-two candidate footprint
that can pass the filter predicate.
7. Raster and winding now pass coordinate columns directly, avoiding temporary
sample matrices along the inside-outside path.
8. Boundary geometry now carries point and normal columns, and boundary signal
sampling and jump evaluation consume those columns directly.
9. Boundary geometry now computes quadratic basis columns directly instead of
building a basis matrix and slicing it back into columns.
10. Quadratic root solving now shares denominator work across the two quadratic
roots, removing one sample-sized division.
11. Winding root solving now separates linear and quadratic validity so root
contribution can derive crossing signs without evaluating the y-derivative
at each root.
12. Winding evaluation now dispatches linear commands through a linear root
solver while keeping quadratic commands on the quadratic path.
13. Boundary jump evaluation now classifies the minus and plus sample sets in
one winding pass over the concatenated sample columns instead of two, which
dropped the evaluate benchmark from 18.795s to 17.696s. This is step 1,
option (a) of the plan; a profile A/B attributed the gain to fewer
allocations and winding dispatch rather than less lane arithmetic.
14. Winding evaluation now batches across a contour's segments instead of
looping per segment: it extracts each coordinate column once, groups
segments by command, and broadcasts the per-segment coefficient columns
against the sample row into a segment-by-sample tensor summed back over the
segment axis. The evaluate benchmark dropped from 18.003s to 16.374s, and a
profile A/B attributed the gain to collapsed per-segment accumulation, fewer
allocations, and less coordinate slicing; raster classification batches the
same way and improved too. This is step 2 of the plan. Chunking was measured
to be inert at the benchmark's 1x1 sampling and left out for simplicity.
15. Linear root solving now negates the per-segment divisor column instead of
the sample-sized constant tensor, removing one full-size pass with
bit-identical results. The evaluate benchmark dropped from 16.255s to
16.054s, about 1.2%, against a same-session baseline, in line with the
0.98% negation `mul` the profile attributed under the solver. This is
step 5 of the plan.
16. Quadratic winding now evaluates both roots in one stacked pass:
`roots::quadratic::solve` emits `[2, segments, samples]` roots and
validity, building the stack from broadcast columns (signed per-segment
denominator columns carry the root signs, so the roots cost one division
and one addition instead of a `cat`), and the contribution shares the
curve evaluation, range checks, and crossing comparison across the roots,
with the per-root sign reduced to per-segment columns. The evaluate
benchmark dropped from 16.054s to 15.905s, about 0.9%, against a
same-session baseline. A profile A/B confirmed the targeted classes:
`mask_where` fell from 2.79% to 1.37% and `sub` from 18.86% to 9.62%,
partly offset by the predicted bookkeeping in broadcast `or`, `div`,
`add`, and `reshape`. This is step 3 of the plan.
17. Boundary geometry is now column-native, and sample-indexed
one-dimensional lookups use `gather` instead of `select`. Control point
coordinates are gathered as columns straight from the flattened
arguments, and points, tangents, normals, and lengths are computed as
columns, removing the basis and mask expansions, the `[samples, 2]`
point gathers, and the column slice-backs, along with the
`tensor::column`, `tensor::expand`, and `point::row_lengths` helpers. On
its own the column rework was nearly inert, because the `select` that
replaced the gathers is slow in the NdArray backend: it funnels the index
tensor through a per-element iterator into a vector before the lookup. An
isolated probe later measured select at 14.0 nanoseconds per element
against 3.0 for gather, about a factor of 4.7, independent of index
buffer sharing. Switching the coordinate
lookups, the signal footprint gathers, the distribution CDF and PMF
lookups, and the linear command mask to `gather` dropped the evaluate
benchmark from 15.587s ± 8.9ms to 13.506s ± 84.3ms, about 13%, against a
same-session baseline. The profile A/B shows `select` disappearing from
the operation table entirely with `gather` at 4.56%, and the boundary
owners halving: `geometry::evaluate` from 10.85% plus 3.54% in
`left_normals` to 5.38%, `signal::sample` from 8.89% to 4.57%, and
`Distribution::sample` from 7.07% to 5.48%. This addresses candidate 4.
18. Winding evaluation now reshapes the sample columns into rows once per
classification, outside the contour loop, instead of once per contour.
This is the offset-free subset of the step 7 rework: the backend reshape
is a copy, so the per-contour reshape copied both sample columns once per
contour. The evaluate benchmark dropped from 13.506s ± 84.3ms to
13.414s ± 5.9ms, about 0.7%, with a confirming rerun at 13.408s ± 22.5ms.
19. Winding root solving now multiplies by reciprocal per-segment columns
instead of dividing the sample-sized tensors: the linear solver multiplies
the constant by the negated reciprocal divisor, and the quadratic solver
shares one reciprocal denominator across the stacked roots. The product
can deviate from the quotient in the last ulp, the first accepted
departure from the reference arithmetic. The evaluate benchmark dropped
from 13.414s ± 5.9ms to 12.936s ± 9.8ms, about 3.6%, against a
same-session baseline, and `div` disappeared from the profile's operation
totals (11.95% before), partly returning as cheaper `mul` (16.73% to
25.30%).
20. The quadratic discriminant now scales the per-segment column instead of
the sample-sized product, computing `b^2 - (a * 4) * c` and removing one
sample-sized pass per quadratic winding call; the factor is a power of
two, so the fold is exact. With machine noise elevated, a back-to-back
paired comparison measured 13.117s ± 69.4ms before and 12.948s ± 30.7ms
after, about 1%.
21. Boundary signal sampling now hoists the filter predicate, the bounds
checks, and the integer indices of the box-filter footprint out of the
cell loop: a whole row or column of cells shares its axis, so each is
computed once per offset instead of once per cell, and one zero tensor
is reused across the cells. The masks and indices are bit-identical to
the per-cell versions. A paired comparison in a calm window measured
12.947s ± 4.3ms before and 12.797s ± 7.3ms after, about 1.2%.
22. Segment-gradient scatters now use one-component indices into the
gradients flattened to `[segments * 4]`, with the flat coordinate
indices derived arithmetically from flat point index columns that
boundary geometry carries instead of `[samples, 2]` index matrices. The
scatter kernel itself is clean, but it pays an index read and a stride
multiplication per index component per update, so the former
three-component tuples cost three times the lookup work, and the zero
and one index columns and the `[3 * samples, 3]` index concatenations
that built them are gone as well. The accumulation order is unchanged,
so the gradients are bit-identical. A paired comparison measured
12.810s ± 22.3ms before and 12.473s ± 96.2ms after, about 2.6%.
## Candidates
These are the detailed observations the plan below draws on and prioritizes;
the plan is the sequenced, actionable version.
1. Reduce whole tensor operations in the quadratic winding contribution.
`quadratic_winding` is the largest local owner at 29.26%, with
`evaluate_polynomial_without_constant` owning another 6.67%. One call costs
roughly forty `[segments, samples]` passes, and the two root contributions
duplicate the polynomial evaluation, the range checks, and the curve-x
comparison between them, about ten passes that stacking the roots would
share. Because each pass is roughly one allocation and one full sweep over
every sample (see the allocate-and-sweep costs under Profile), cutting the
pass count cuts allocation and materialization one for one. The plan
covered this as step 3, now completed item 16.
2. Treat root solving as part of the winding contribution path.
`roots::quadratic::solve` is inlined into `quadratic_winding` and its cost
appears there. `roots::linear::solve` is visible at 4.86%, almost entirely
the division at 3.83% plus a 0.98% negation pass that step 5 removed
(completed item 15). Avoiding root solving for samples or segments that
cannot contribute stays deprioritized: samples span the whole image, so
nearly every segment overlaps, and the eager backend cannot skip individual
lanes cheaply. Keep it only as a later option if a sparser sampling regime
or a non-eager backend changes that arithmetic.
3. Batch boundary jump classification across a layout's contours. Measured
and rejected.
`boundary::differentiate` calls `gradient::jumps` once per contour, and
each call evaluates the winding of all the layout's contours against that
contour's `2N` sample lanes, so per-operation fixed overhead scales with
the square of the contour count. Concatenating the boundary sample columns
of all contours and classifying once per layout pays coefficient
construction and dispatch once instead of per contour, but the measured
variant regressed the benchmark; see step 6 for the result and the cache
explanation.
4. Inspect the boundary select and slice block after winding. Addressed by
completed item 17.
Boundary work that does not pass through winding was 37.67% of all
samples, dominated by slice-backs, expansions, and the slow
one-dimensional `select`. The column-native geometry and the gather swap
removed most of it. What remains afterwards is `segment_gradients` (7.35%,
scatter-bound), `Distribution::sample` (5.48%, mostly the segment-by-
sample CDF comparison and argmax), `geometry::evaluate` (5.38%), and
`signal::sample` (4.57%, four footprint gathers and masks).
5. Treat full boundary linear specialization as speculative.
The clearest direct targets are the repeated linear mask in normal
evaluation, one third of control-point gathers for linear samples, and one
third of gradient scatter updates for linear samples. Splitting mixed
sample records may add enough selection and concatenation work to erase
much of that gain.
6. Keep segment-gradient scatter low priority unless it simplifies code.
A single vector-valued scatter variant was tested and did not improve the
benchmark. The current two-scatter layout leaves `scatter_nd` at 2.77%, so
there is not enough signal to prioritize it on performance alone.
## Plan
The profile splits into two cost components. Element-proportional work scales
with `segments * samples * operations-per-segment` and shrinks only by cutting
operations per segment or the number of (segment, sample) pairs evaluated. Per
operation fixed overhead (`slice`, `squeeze_dim`, `malloc`, part of `from_iter`,
accumulation adds, and dispatch) scales with the operation count and shrinks by
issuing fewer, larger operations. The NdArray backend is eager and does not
fuse, and the fusing backends are far slower here, so the operation count has to
come down by hand rather than through the backend.
The hot dimensions and the structural multipliers that drive the operation count
are:
```text
samples per contour N = height * width * x_samples * y_samples = 64*64*1*1 = 4096
per-segment coefficients are [segments, 1] columns broadcast over N
jumps classifies minus and plus coverage in one batched pass over 2N lanes
evaluate_contour batches a contour's segments into [segments, N] passes
quadratic::evaluate evaluates both roots in one stacked [2, segments, N] pass
differentiate calls jumps per contour, so winding repeats per contour pair
layouts hold one glyph in isolation, so the contour count is one to three
```
Batching does not undo the completed work that removed slicing by keeping
coordinates separated (items 7 to 9 above, and benchmark entries 92880de and
5324746). That work separates the coordinate axis: distinct quantities such as
x, y, normals, basis weights, and polynomial coefficients each stay a column
rather than being packed into a matrix that is sliced back on every use.
Batching packs the lane axis instead: homogeneous indices over which the same
work repeats, such as samples, the minus and plus pair, segments, and roots. The
two are orthogonal. A batched column is a longer or wider version of a separated
column, the columns stay separated inside the batch (concatenate x with x and y
with y, never x with y), and the matrices that appear are reduced by summation
rather than sliced back into coordinates. Steps 1 and 2 in fact lower the
coordinate slicing, because they stop rebuilding per-segment coefficients on
redundant passes; the only literal splits are the flat result splits in steps
1 and 6.
The steps are ordered by leverage over risk.
1. Done (option (a); completed item 13). Batching the minus and plus lane in
`gradient::jumps` dropped the evaluate benchmark from 18.795s to 17.696s,
about 5.6%. It concatenates each coordinate column along the sample axis (x
with x, y with y), classifies once over the `2N` columns, then splits the
flat result back and subtracts. The profile A/B confirmed the prediction:
fewer allocations and winding dispatch, lane arithmetic untouched. Option
(b), hoisting coefficients without concatenation, was not needed; (a)
subsumes its saving.
2. Done (completed item 14). Batching a contour's segments in
`evaluate_contour` dropped the evaluate benchmark from 18.003s to 16.374s,
about 9%. It extracts each coordinate column once (the end column is a cyclic
shift of the start column), groups segments by command to keep the linear
two-point and quadratic three-point paths, and broadcasts the `[segments, 1]`
coefficient columns against the `[1, samples]` row, then sums the resulting
`[segments, samples]` tensor back over the segment axis. Those coordinate
vectors are the separated-column shape completed items 8 and 9 use, and the
matrix is summed away, not sliced back. The profile A/B confirmed the
prediction: collapsed per-segment accumulation, fewer allocations, and less
slicing, with the new `sum_dim` and grouping `select` small. Chunking was
planned to bound the intermediates for cache residency, but a chunk-size
sweep was inert at the benchmark's 1x1 sampling, so it was left out for
simplicity; revisit only if a heavier-sampling benchmark shows the
intermediates blowing cache.
3. Done (completed item 16). Sharing the two quadratic roots in one stacked
`[2, segments, samples]` pass dropped the evaluate benchmark from 16.054s
to 15.905s, about 0.9%, against a same-session baseline. The stack came
from signed per-segment denominator columns rather than a `cat`, as
planned, and the per-root sign collapsed to per-segment columns, removing
the sample-sized sign `mask_where`. The predicted sign and validity
bookkeeping offset part of the dispatch saving: the profile A/B shows
`mask_where` and `sub` falling sharply while broadcast `or`, `div`, `add`,
and `reshape` rise.
4. Done via step 2. Hoisting the coordinate-column extraction out of the
per-segment loop is subsumed by the batched-segment pass, which extracts each
coordinate column once per contour instead of slicing and squeezing per
segment per coordinate.
5. Done (completed item 15). Folding the negation in `roots::linear::solve`
into the divisor, dividing by the negated `safe_divisor(b)` column instead
of negating the `[segments, samples]` constant, dropped the evaluate
benchmark from 16.255s to 16.054s against a same-session baseline, about
1.2%, matching the 0.98% `mul` the profile attributed to the negation. The
results are bit-identical because IEEE division commutes the sign between
numerator and denominator.
6. Measured and not kept. Batching boundary jump classification across a
layout's contours (concatenating each boundary coordinate column across
contours, x with x and y with y, and calling `gradient::jumps` once per
layout with the flat result split back per contour) regressed the evaluate
benchmark from 15.587s ± 8.9ms to 15.683s ± 2.4ms against a same-session
baseline. The dispatch saving was real, with winding invocations falling
from one per contour pair to one per contour, but every winding
intermediate grew by the layout's contour count, from `[segments, 2N]` to
`[segments, 2N * contours]` columns, and the benchmark's multi-glyph
layouts carry tens of contours (the fixture averages three to four
contours per glyph with several glyph instances per layout). The enlarged
intermediates pushed the per-operation working set out of cache, and the
element-proportional sweeps slowed by more than the fixed overhead saved;
four `[contours * samples]` concatenations per layout were also added.
This is the cache-blowup scenario the chunking note in step 2 anticipated.
A cache-bounded variant that batches a few contours at a time remains
possible if per-operation overhead grows dominant again.
7. Measured and not kept. The backend reshape is always a copy in Burn 0.21's
NdArray backend: `reshape` goes through ndarray's `to_shape` followed by
`into_shared`, and the borrowed view `to_shape` returns for a
standard-layout array forces `into_shared` to clone the buffer, so every
`reshape` and `unsqueeze` allocates and copies regardless of buffer
uniqueness. Rebuilding the winding path so sample-sized tensors are born
with the root axis through broadcasting (lifting only sample rows and
per-segment columns) and hoisting the sample-row reshape out of the
contour loop cut the reshape share from 5.38% to 3.57%, but the
replacements cost the same: an extra segment-axis summation pass raised
`sum_dim` from 1.41% to 2.41%, and the lifted rows still copy. The
evaluate benchmark was inert at 15.581s ± 5.2ms against 15.587s ± 8.9ms,
so the change was left out, following the chunking precedent in step 2.
Its offset-free subset, hoisting the sample-row reshape out of the contour
loop, was later kept as completed item 18.
8. Measured and not kept, twice over: the winding pass audit. Folding the
crossing signs and the segment summation into a matrix product
(`sign` transposed against the float validity, bit-identical because the
contributions are exact zero and unit terms) regressed the evaluate
benchmark from 12.956s ± 10.0ms to 13.018s ± 35.6ms in a paired
comparison: a single-row product is a degenerate matmul that the kernel
handles worse than the two streaming sweeps it replaces. Replacing the two
parameter range comparisons with one clamp and one equality
(`clamp(t, 0, 1) == t`) regressed it from 12.964s ± 5.6ms to
14.411s ± 7.4ms, about 11%: the scalar comparisons are cheap bool-output
passes, while the clamp materializes a float tensor and the equality reads
two float operands. The audit's conclusion is that the winding pass
structure is at its operation-count floor for this backend: scalar
comparisons, boolean conjunctions, and the masked sum are each already in
their cheapest available form, and algebraic folds that look like fewer
operations cost more as tensors.
Deprioritize quadratic root algebra micro-optimizations (element-proportional
and already tight), segment-level geometric pruning (samples span the whole
image, so nearly every segment overlaps), and any reliance on backend fusion.
Contour-pair pruning in boundary winding is also deprioritized: layouts hold
one glyph in isolation, glyphs carry one to three mostly nested contours, and
the only zero-valued pair — a glyph's outer samples against a counter — is a
small term that needs a bounding-circle test to prove the separation that
bounding boxes cannot, while introducing data-dependent branching into a
uniform pipeline. Revisit only if compositions ever hold multiple glyphs.
Validate each step with `make benchmark BENCHMARK=evaluate` for timing and `make
profile` to confirm the targeted cost class actually dropped, then record the
result in BENCHMARKING.md and refresh this document.
## Direction
The Profile section above is current as of completed item 21; item 22 has
since flattened the segment-gradient scatter indices, and the step 8 audit
closed winding operation count as exhausted for this backend. The remaining
boundary residue is structurally sound, and the local optimization list is
exhausted pending the upstream fixes below.
Two backend findings have fixes in progress upstream, both verified on Burn's
main branch and pinned by an isolated probe: `select` converts the index
tensor through a per-element iterator before the lookup, measuring 14.0
nanoseconds per element against 3.0 for `gather`, addressed by
[pull request 5066][select], and `reshape` copies at memcpy bandwidth even
for uniquely owned standard-layout buffers (step 7), addressed by
[pull request 5067][reshape]. Once those land in a release, revisit the
gather workaround from item 17, since the fixed `select` measures faster than
`gather`, and the reshape economics behind steps 7 and 8 and item 18. For
boundary work, prefer small local cleanups over a full linear/quadratic
sample split unless a benchmark demonstrates that the split pays for its
selection overhead.
[reshape]: https://github.com/tracel-ai/burn/pull/5067
[samply]: https://github.com/mstange/samply
[select]: https://github.com/tracel-ai/burn/pull/5066