1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
//! Cortex-M0+ embedded inference benchmark.
//!
//! Proves irithyll-core runs on bare-metal ARM (no_std, no alloc, no FPU).
//! Both f32 packed and i16 quantized inference on Cortex-M0+.
//!
//! # Build (from irithyll-core/)
//!
//! ```bash
//! cargo build --features embedded-bench --target thumbv6m-none-eabi --release --example cortex_m_bench
//! ```
//!
//! # Run under QEMU
//!
//! ```bash
//! qemu-system-arm -cpu cortex-m0 -machine lm3s6965evb -nographic \
//! -semihosting-config enable=on,target=native \
//! -kernel target/thumbv6m-none-eabi/release/examples/cortex_m_bench
//! ```
//!
//! # TODO: Extended embedded benchmarks
//!
//! The following benchmarks are planned but require cross-compilation setup
//! (thumbv6m-none-eabi / thumbv7em-none-eabihf toolchain + QEMU ARM).
//!
//! ## Model size sweep (tree ensembles)
//!
//! Benchmark packed inference latency at three ensemble sizes:
//! - Small: 10 trees, depth 3 (bench_model_10t.bin)
//! - Medium: 50 trees, depth 4 (bench_model_50t.bin, current)
//! - Large: 100 trees, depth 5 (bench_model_100t.bin)
//!
//! Expected: latency scales ~linearly with n_trees * avg_depth.
//! Pack new test_data binaries with `export_packed` + `export_packed_i16`.
//!
//! ## Step-count throughput (10, 50, 100 prediction steps)
//!
//! Loop the prediction call N times and report total cycles / N:
//! - 10 steps: single-sample latency baseline
//! - 50 steps: representative online inference burst
//! - 100 steps: sustained throughput with instruction cache warmed up
//!
//! Use `cortex_m::asm::nop()` fences between loops to prevent DCE.
//!
//! ## Cortex-M4 FPU comparison (thumbv7em-none-eabihf)
//!
//! Re-run f32 packed inference on M4 with hardware FPU enabled.
//! Expected: ~2-4x speedup on tree walks (FPU fmul, fmadd pipelining).
//! Use qemu-system-arm with `-cpu cortex-m4` and the lm3s6965evb machine.
//!
//! ## TurboQuant weight inference on M0+
//!
//! Benchmark `TurboQuantizedView::predict_with_scratch` on ARM:
//! - 64-weight vector (d_model=64 RLS readout), 3.5-bit mode
//! - Measure cycles for FWHT rotation + base-11 unpack + dot product
//! - Compare against raw f32 dot product of same vector
//! Requires: static `PACKED_WEIGHTS: &[u8]` embedded via `include_bytes!`.
use entry;
use hprintln;
use panic_halt as _;
!
// Stub for non-ARM hosts (x86 CI, cargo check --all-targets).