hft-benchmarks 0.1.0

High-precision benchmarking tools for high-frequency trading systems with nanosecond-level timing accuracy
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
# HFT Benchmarks

High-precision performance measurement tools for Rust applications requiring nanosecond-level timing accuracy.

## Quick Start

Add to your `Cargo.toml`:
```toml
[dependencies]
hft-benchmarks = { path = "../path/to/hft-benchmarks" }
```

Simple benchmark:
```rust
use hft_benchmarks::*;

fn main() {
    quick_calibrate_tsc_frequency();
    
    SimpleBench::new("my_function")
        .bench(1000, || my_expensive_function())
        .report();
}
```

Output:
```
my_function: 1000 samples, mean=245ns, p50=230ns, p95=310ns, p99=450ns, p99.9=890ns, std_dev=45.2ns
```

## Usage Examples

### Basic Timing

```rust
use hft_benchmarks::*;

// One-time setup (do this once at program start)
calibrate_tsc_frequency();

// Time a single operation
let (result, elapsed_ns) = time_function(|| {
    expensive_computation()
});
println!("Operation took {}ns", elapsed_ns);
```

### Statistical Analysis

```rust
// Collect multiple measurements for statistical analysis
let mut results = BenchmarkResults::new("algorithm_comparison".to_string());

for _ in 0..1000 {
    let timer = PrecisionTimer::start();
    your_algorithm();
    results.record(timer.stop());
}

let analysis = results.analyze();
println!("{}", analysis.summary());

// Check if performance meets requirements
if analysis.meets_target(100) {  // P99 < 100ns
    println!("✓ Performance target met");
} else {
    println!("✗ Too slow: P99 = {}ns", analysis.p99);
}
```

### Comparing Implementations

```rust
use hft_benchmarks::*;

fn main() {
    quick_calibrate_tsc_frequency();
    
    // Benchmark old implementation
    let old_perf = SimpleBench::new("old_algorithm")
        .bench(5000, || old_implementation())
        .analyze();
    
    // Benchmark new implementation
    let new_perf = SimpleBench::new("new_algorithm")
        .bench(5000, || new_implementation())
        .analyze();
    
    // Calculate improvement
    let speedup = old_perf.mean as f64 / new_perf.mean as f64;
    println!("New implementation is {:.1}x faster", speedup);
    println!("Old: {}ns P99, New: {}ns P99", old_perf.p99, new_perf.p99);
}
```

### Memory Allocation Benchmarks

```rust
use hft_benchmarks::*;

fn main() {
    quick_calibrate_tsc_frequency();
    
    // Run built-in allocation benchmarks
    benchmark_allocations();           // Test different allocation sizes
    benchmark_object_pools();          // Compare pool vs direct allocation
    benchmark_aligned_allocations();   // Test memory alignment impact
}
```

Example output:
```
Benchmarking memory allocations (10000 iterations per size)...
allocation_64B: 10000 samples, mean=89ns, p50=70ns, p95=120ns, p99=180ns
allocation_1024B: 10000 samples, mean=145ns, p50=130ns, p95=200ns, p99=280ns

Pool allocation: pool_allocation: 10000 samples, mean=65ns, p50=60ns, p95=85ns, p99=110ns
Direct allocation: direct_allocation: 10000 samples, mean=140ns, p50=130ns, p95=180ns, p99=220ns
```

## API Reference

### Setup and Calibration

```rust
// Required once at program startup for accurate timing
calibrate_tsc_frequency();        // 1000ms calibration (most accurate)
quick_calibrate_tsc_frequency();  // 100ms calibration (faster, less accurate)
```

### SimpleBench (Recommended)

Fluent API for quick benchmarking:

```rust
use hft_benchmarks::SimpleBench;

SimpleBench::new("operation_name")
    .bench(iterations, || your_function())
    .report();                     // Print results
    
// Or get analysis object
let analysis = SimpleBench::new("operation_name")
    .bench(iterations, || your_function())
    .analyze();
```

### Manual Timing

For custom measurement logic:

```rust
use hft_benchmarks::{PrecisionTimer, time_function};

// Time a single operation
let timer = PrecisionTimer::start();
expensive_operation();
let elapsed_ns = timer.stop();

// Time function with return value
let (result, elapsed_ns) = time_function(|| {
    compute_something()
});
```

### Statistical Analysis

```rust
use hft_benchmarks::BenchmarkResults;

let mut results = BenchmarkResults::new("test_name".to_string());

// Collect measurements
for _ in 0..1000 {
    let elapsed = time_operation();
    results.record(elapsed);
}

// Analyze results
let analysis = results.analyze();
println!("Mean: {}ns, P99: {}ns", analysis.mean, analysis.p99);

// Check performance target
if analysis.meets_target(500) {  // P99 < 500ns
    println!("Performance target met!");
}
```

## Understanding Results

The benchmark results show statistical distribution of timing measurements:

```
function_name: 1000 samples, mean=245ns, p50=230ns, p95=310ns, p99=450ns, p99.9=890ns, std_dev=45.2ns
```

- **mean**: Average execution time
- **p50** (median): 50% of operations complete faster than this
- **p95**: 95% of operations complete faster than this
- **p99**: 99% of operations complete faster than this (critical for tail latency)
- **p99.9**: 99.9% of operations complete faster than this
- **std_dev**: Standard deviation (consistency indicator)

### Why P99 Matters

In performance-critical systems:
- **Mean** can hide outliers that hurt user experience
- **P99** shows worst-case performance for 99% of operations
- **P99.9** reveals extreme outliers that can cause system issues

Example: A function averaging 100ns but with P99 of 10ms will cause problems despite good average performance.

## Running Tests

Run the benchmark test suite:

```bash
# From project root
cd /path/to/hft-framework/Code
cargo test --package hft-benchmarks -- --nocapture

# Or from benchmark crate directory
cd crates/hft-benchmarks
cargo test --lib -- --nocapture
```

Run example benchmarks:
```bash
cargo run --example simple_benchmark_example
```

## Best Practices

### 1. Calibration
Always calibrate before benchmarking:
```rust
// At program start
quick_calibrate_tsc_frequency();  // For development/testing
// OR
calibrate_tsc_frequency();        // For production measurements
```

### 2. Sample Size
Use appropriate sample sizes:
```rust
// Quick development check
SimpleBench::new("dev_test").bench(100, || function()).report();

// Production validation  
SimpleBench::new("prod_test").bench(10000, || function()).report();
```

### 3. Warm-up
Account for JIT compilation and cache warming:
```rust
// Warm up
for _ in 0..1000 { function(); }

// Then benchmark
SimpleBench::new("warmed_up").bench(5000, || function()).report();
```

### 4. System Considerations
- Run on isolated CPU cores for consistent results
- Disable CPU scaling for accurate measurements
- Minimize background processes during benchmarking
- Use release mode builds (`cargo run --release`)

## Common Use Cases

### 1. Development - Quick Performance Check

```rust
use hft_benchmarks::*;

fn main() {
    quick_calibrate_tsc_frequency();
    
    SimpleBench::new("new_feature")
        .bench(1000, || my_new_function())
        .report();
}
```

### 2. Optimization - Algorithm Comparison

```rust
use hft_benchmarks::*;

fn compare_algorithms() {
    quick_calibrate_tsc_frequency();
    
    println!("=== Algorithm Comparison ===");
    
    let results_a = SimpleBench::new("algorithm_a")
        .bench(5000, || algorithm_a())
        .analyze();
        
    let results_b = SimpleBench::new("algorithm_b")
        .bench(5000, || algorithm_b())
        .analyze();
    
    println!("Algorithm A: {}ns P99", results_a.p99);
    println!("Algorithm B: {}ns P99", results_b.p99);
    
    if results_b.p99 < results_a.p99 {
        let improvement = (results_a.p99 as f64 / results_b.p99 as f64 - 1.0) * 100.0;
        println!("Algorithm B is {:.1}% faster (P99)", improvement);
    }
}
```

### 3. Production - Performance Validation

```rust
use hft_benchmarks::*;

fn validate_performance() {
    calibrate_tsc_frequency();  // Full calibration for accuracy
    
    let analysis = SimpleBench::new("critical_path")
        .bench(10000, || critical_trading_function())
        .analyze();
    
    // Ensure P99 latency meets requirements
    const MAX_P99_NS: u64 = 500;
    assert!(
        analysis.meets_target(MAX_P99_NS),
        "Performance regression: P99 = {}ns (max allowed: {}ns)",
        analysis.p99,
        MAX_P99_NS
    );
    
    println!("✓ Performance validation passed");
    println!("  Mean: {}ns, P99: {}ns, P99.9: {}ns", 
             analysis.mean, analysis.p99, analysis.p999);
}
```

### 4. Memory Optimization

```rust
use hft_benchmarks::*;

fn optimize_memory_usage() {
    quick_calibrate_tsc_frequency();
    
    println!("=== Memory Allocation Comparison ===");
    
    // Test stack allocation
    SimpleBench::new("stack_alloc")
        .bench(10000, || {
            let data = [0u64; 64];  // Stack allocated
            std::hint::black_box(data);
        })
        .report();
    
    // Test heap allocation  
    SimpleBench::new("heap_alloc")
        .bench(10000, || {
            let data = vec![0u64; 64];  // Heap allocated
            std::hint::black_box(data);
        })
        .report();
    
    // Use built-in memory benchmarks
    benchmark_object_pools();
}

## Running Complete Benchmark Suite

### Memory Allocation Analysis

```bash
cargo run --example simple_benchmark_example
```
**Output:**
```
=== Vector Allocation Benchmark ===
vec_allocation: 1000 samples, mean=185ns, p50=170ns, p95=220ns, p99=992ns

=== Implementation Comparison ===
Old: 90ns P99, New: 50ns P99  
Improvement: 166.7% faster
```

### Custom Benchmarks

```rust
use hft_benchmarks::*;

fn main() {
    calibrate_tsc_frequency();
    
    // Benchmark your trading algorithm
    SimpleBench::new("order_processing")
        .bench(10000, || process_market_order())
        .report();
        
    // Memory-intensive operations
    benchmark_allocations();
    benchmark_object_pools();
}
```

## Technical Details

### Precision and Accuracy

This library uses CPU timestamp counters (TSC) for nanosecond-precision timing:

- **TSC-based timing**: Direct CPU cycle counting via `_rdtsc()` instruction
- **Memory barriers**: Prevents instruction reordering that could affect measurements
- **Calibrated conversion**: Converts CPU cycles to nanoseconds based on measured frequency
- **Minimal overhead**: ~35ns measurement overhead

### Measurement Overhead

The benchmark tools themselves have minimal impact:
```
PrecisionTimer overhead: ~35ns
Function call overhead: ~37ns
Statistical calculation: <1μs for 10k samples
Memory allocation test: ~100-500ns per iteration
```

### System Requirements

- **x86_64 CPU** with stable TSC (most modern processors)  
- **Linux, macOS, or Windows**
- **Rust 1.70+**

**Platform Compatibility Note**: This benchmark crate uses x86_64-specific intrinsics (`core::arch::x86_64`) for high-precision timing via CPU timestamp counters. It will not compile on ARM-based systems (including Apple Silicon Macs). For ARM platforms, consider using alternative timing mechanisms or conditional compilation.

### Limitations

- **CPU frequency scaling** can affect accuracy (disable for best results)
- **System load** impacts measurement consistency  
- **Compiler optimizations** may eliminate benchmarked code (use `std::hint::black_box`)
- **First run variance** due to cache warming and JIT compilation

### Integration with Other Tools

Use alongside other profiling tools for comprehensive analysis:
- **perf** for hardware counter analysis
- **valgrind** for memory profiling
- **flamegraph** for call stack visualization  
- **criterion** for statistical benchmarking

This library excels at **microbenchmarks** and **latency-critical code paths** where nanosecond precision matters.