# framealloc Benchmarks
Comprehensive performance analysis of framealloc v0.9.0 compared to alternative allocators.
## Methodology
### Test Environment
- **CPU**: AMD Ryzen 9 5950X (16 cores, 32 threads)
- **RAM**: 64GB DDR4-3200
- **OS**: Ubuntu 22.04 LTS
- **Rust**: 1.75.0 (optimization level 3)
- **Benchmark tool**: Criterion 0.5
### Test Categories
1. **Single allocations** - One-off allocations of various sizes
2. **Batch allocations** - 1000 allocations in tight loops
3. **Realistic workloads** - Simulated game engine scenarios
4. **Multithreading** - Scalability across threads
5. **Alignment stress** - Mixed alignment patterns
### Competitors
- **framealloc** - This crate (v0.9.0)
- **bumpalo** - Popular bump allocator (v3.14)
- **system malloc** - Platform's system allocator
- **Box<Vec<T>>** - Rust's heap allocation via Box
---
## Single Allocation Performance
| 64B | 4.1ns | 15.2ns | 12.4ns | 28.3ns | **3.0x faster** |
| 256B | 4.2ns | 15.8ns | 12.8ns | 31.1ns | **3.0x faster** |
| 1KB | 4.3ns | 16.1ns | 15.2ns | 45.7ns | **3.5x faster** |
| 4KB | 4.4ns | 15.9ns | 15.8ns | 67.2ns | **3.6x faster** |
| 64KB | 4.5ns | 15.2µs | 15.3µs | 15.8µs | **3400x faster** |
**Key insights:**
- framealloc maintains constant ~4ns allocation time regardless of size
- bumpalo and malloc degrade significantly for large allocations
- Box allocation has the highest overhead due to reference counting
---
## Batch Allocation Performance (1000 items)
| Individual calls | 8,920ns | 9,180ns | 12,450ns | **1.4x faster** |
| **Batch API** | **64ns** | N/A | N/A | **194x faster** |
| Specialized (frame_alloc_8 ×125) | 341ns | N/A | N/A | **36x faster** |
**Benchmark details:**
```rust
// Individual allocations (baseline)
for _ in 0..1000 {
let item = alloc.frame_alloc::<[u8; 64]>();
}
// Batch allocation (139x faster than individual framealloc)
let items = alloc.frame_alloc_batch::<[u8; 64]>(1000);
```
---
## Realistic Game Workloads
### Physics Frame (800 allocations, 95KB total)
| framealloc | 3,245ns | **3.8x faster** |
| bumpalo | 3,892ns | **3.2x faster** |
| malloc | 12,380ns | baseline |
### Particle System (10,000 allocations, 640KB)
| framealloc | 40,120ns | **4.1x faster** |
| bumpalo | 38,920ns | **4.2x faster** |
| malloc | 162,450ns | baseline |
### UI Rendering (200 allocations, 32KB)
| framealloc | 1,820ns | **2.8x faster** |
| bumpalo | 1,945ns | **2.6x faster** |
| malloc | 5,120ns | baseline |
---
## Multithreading Scalability
### 64-byte allocations per thread
| 1 | 4.1ns | 12.4ns | **3.0x** |
| 2 | 4.2ns | 24.8ns | **5.9x** |
| 4 | 4.3ns | 49.2ns | **11.4x** |
| 8 | 4.5ns | 98.1ns | **21.8x** |
| 16 | 4.8ns | 195.3ns | **40.7x** |
**Key insight:** framealloc scales perfectly with thread count due to thread-local arenas, while malloc contention increases linearly.
---
## Feature Impact Analysis
### Minimal Mode Performance
| Default mode | 8.9ns | 6.8ns |
| **Minimal mode** | **2.1ns** | **0ns** |
| Improvement | **76% faster** | - |
Minimal mode disables:
- Allocation statistics tracking
- Tag-based bookkeeping
- Debug assertions
- Behavior filter instrumentation
### Cache Prefetch Impact (x86_64 only)
| Alloc+write | 518ns | 462ns | **10.8%** |
| Alloc+read | 518ns | 515ns | **0.6%** |
| Alloc only | 518ns | 520ns | **-0.4%** |
Prefetch helps most when you immediately write to allocated memory.
---
## Alignment Stress Test
Mixed alignment patterns (7B, 31B, 63B alternating):
| framealloc | 4,567ns | 2.1% | **97.9%** |
| bumpalo | 5,234ns | 3.8% | 96.2% |
| malloc | 13,892ns | 12.4% | 87.6% |
framealloc's bump pointer handles misalignment efficiently with minimal padding.
---
## Memory Usage Analysis
### Per-thread overhead
| Frame arena | 4-16MB | Bump allocation space |
| Pool allocator | ~64KB | Small object pools |
| Statistics | ~1KB | Allocation tracking |
| Metadata | ~512B | Thread-local state |
### Global overhead
| Slab allocator | 1-4MB | Pool refills |
| Shared metadata | ~10KB | Cross-thread coordination |
---
## Reproducibility
### Running Benchmarks
```bash
# All benchmarks
cargo bench
# Specific category
cargo bench --bench allocators
cargo bench --bench optimizations --features "minimal,prefetch"
# With profiling
cargo bench --bench allocators -- --profile-time 10
```
### Benchmark Source
All benchmarks are in `/benches/` directory:
- `allocators.rs` - Core allocator comparisons
- `optimizations.rs` - Feature impact analysis
- `minimal_mode.rs` - Minimal mode validation
### Statistical Notes
- All results use Criterion's statistical analysis
- 95% confidence intervals
- Warmup period included
- Outlier detection enabled
- Results averaged over 100+ samples
---
## Interpretation Guide
### When framealloc excels:
1. **Game loops** - Frame-based lifecycle matches perfectly
2. **Particle systems** - Thousands of small, short-lived objects
3. **Physics simulation** - Per-frame contact/force data
4. **Multithreaded rendering** - Thread-local arenas prevent contention
5. **Real-time systems** - Predictable, constant-time allocations
### When to consider alternatives:
1. **Long-lived data** - Use heap/pool allocations
2. **Unknown allocation counts** - Individual APIs more flexible
3. **Cross-thread sharing** - Frame data isn't thread-safe
4. **Memory-constrained environments** - Fixed arena sizes
### Optimization recommendations:
1. **Enable minimal mode** for production builds
2. **Use batch APIs** for >100 allocations in loops
3. **Consider prefetch** for write-heavy patterns on x86_64
4. **Profile first** - not all patterns benefit equally
---
## Historical Trends
| 0.8.0 | Tokio integration | N/A | 4ns |
| 0.9.0 | Batch APIs | 139x | 4ns |
| 0.9.0 + minimal | Statistics disabled | 194x vs malloc | 2ns |
The v0.9.0 release represents the largest single performance improvement in framealloc's history, particularly for batch allocation patterns common in game engines.