HFT Benchmarks
High-precision performance measurement tools for Rust applications requiring nanosecond-level timing accuracy.
Quick Start
Add to your Cargo.toml:
[]
= { = "../path/to/hft-benchmarks" }
Simple benchmark:
use *;
Output:
my_function: 1000 samples, mean=245ns, p50=230ns, p95=310ns, p99=450ns, p99.9=890ns, std_dev=45.2ns
Usage Examples
Basic Timing
use *;
// One-time setup (do this once at program start)
calibrate_tsc_frequency;
// Time a single operation
let = time_function;
println!;
Statistical Analysis
// Collect multiple measurements for statistical analysis
let mut results = new;
for _ in 0..1000
let analysis = results.analyze;
println!;
// Check if performance meets requirements
if analysis.meets_target else
Comparing Implementations
use *;
Memory Allocation Benchmarks
use *;
Example output:
Benchmarking memory allocations (10000 iterations per size)...
allocation_64B: 10000 samples, mean=89ns, p50=70ns, p95=120ns, p99=180ns
allocation_1024B: 10000 samples, mean=145ns, p50=130ns, p95=200ns, p99=280ns
Pool allocation: pool_allocation: 10000 samples, mean=65ns, p50=60ns, p95=85ns, p99=110ns
Direct allocation: direct_allocation: 10000 samples, mean=140ns, p50=130ns, p95=180ns, p99=220ns
API Reference
Setup and Calibration
// Required once at program startup for accurate timing
calibrate_tsc_frequency; // 1000ms calibration (most accurate)
quick_calibrate_tsc_frequency; // 100ms calibration (faster, less accurate)
SimpleBench (Recommended)
Fluent API for quick benchmarking:
use SimpleBench;
new
.bench
.report; // Print results
// Or get analysis object
let analysis = new
.bench
.analyze;
Manual Timing
For custom measurement logic:
use ;
// Time a single operation
let timer = start;
expensive_operation;
let elapsed_ns = timer.stop;
// Time function with return value
let = time_function;
Statistical Analysis
use BenchmarkResults;
let mut results = new;
// Collect measurements
for _ in 0..1000
// Analyze results
let analysis = results.analyze;
println!;
// Check performance target
if analysis.meets_target
Understanding Results
The benchmark results show statistical distribution of timing measurements:
function_name: 1000 samples, mean=245ns, p50=230ns, p95=310ns, p99=450ns, p99.9=890ns, std_dev=45.2ns
- mean: Average execution time
- p50 (median): 50% of operations complete faster than this
- p95: 95% of operations complete faster than this
- p99: 99% of operations complete faster than this (critical for tail latency)
- p99.9: 99.9% of operations complete faster than this
- std_dev: Standard deviation (consistency indicator)
Why P99 Matters
In performance-critical systems:
- Mean can hide outliers that hurt user experience
- P99 shows worst-case performance for 99% of operations
- P99.9 reveals extreme outliers that can cause system issues
Example: A function averaging 100ns but with P99 of 10ms will cause problems despite good average performance.
Running Tests
Run the benchmark test suite:
# From project root
# Or from benchmark crate directory
Run example benchmarks:
Best Practices
1. Calibration
Always calibrate before benchmarking:
// At program start
quick_calibrate_tsc_frequency; // For development/testing
// OR
calibrate_tsc_frequency; // For production measurements
2. Sample Size
Use appropriate sample sizes:
// Quick development check
new.bench.report;
// Production validation
new.bench.report;
3. Warm-up
Account for JIT compilation and cache warming:
// Warm up
for _ in 0..1000
// Then benchmark
new.bench.report;
4. System Considerations
- Run on isolated CPU cores for consistent results
- Disable CPU scaling for accurate measurements
- Minimize background processes during benchmarking
- Use release mode builds (
cargo run --release)
Common Use Cases
1. Development - Quick Performance Check
use *;
2. Optimization - Algorithm Comparison
use *;
3. Production - Performance Validation
use *;
4. Memory Optimization
use *;
## Running Complete Benchmark Suite
### Memory Allocation Analysis
```bash
cargo run --example simple_benchmark_example
Output:
=== Vector Allocation Benchmark ===
vec_allocation: 1000 samples, mean=185ns, p50=170ns, p95=220ns, p99=992ns
=== Implementation Comparison ===
Old: 90ns P99, New: 50ns P99
Improvement: 166.7% faster
Custom Benchmarks
use *;
Technical Details
Precision and Accuracy
This library uses CPU timestamp counters (TSC) for nanosecond-precision timing:
- TSC-based timing: Direct CPU cycle counting via
_rdtsc()instruction - Memory barriers: Prevents instruction reordering that could affect measurements
- Calibrated conversion: Converts CPU cycles to nanoseconds based on measured frequency
- Minimal overhead: ~35ns measurement overhead
Measurement Overhead
The benchmark tools themselves have minimal impact:
PrecisionTimer overhead: ~35ns
Function call overhead: ~37ns
Statistical calculation: <1μs for 10k samples
Memory allocation test: ~100-500ns per iteration
System Requirements
- x86_64/ARM CPU with stable TSC (most modern processors), on aarch64 tsc will not be available
- Linux, macOS, or Windows
- Rust 1.70+
Limitations
- CPU frequency scaling can affect accuracy (disable for best results)
- System load impacts measurement consistency
- Compiler optimizations may eliminate benchmarked code (use
std::hint::black_box) - First run variance due to cache warming and JIT compilation
Integration with Other Tools
Use alongside other profiling tools for comprehensive analysis:
- perf for hardware counter analysis
- valgrind for memory profiling
- flamegraph for call stack visualization
- criterion for statistical benchmarking
This library excels at microbenchmarks and latency-critical code paths where nanosecond precision matters.