map2fig 0.3.0

Fast, publication-quality HEALPix sky map visualization in Rust
Documentation
# Session Summary: Cairo Batching + Image Pre-rendering Optimizations

**Date**: February 15, 2026  
**Commits**: 3 major commits achieving 51.4% total speedup

---

## Accomplishments

### Phase 1: Cairo Call Batching (v0.3.0)
✅ **Completed Successfully**
- Reduced per-pixel Cairo `fill()` calls from 51,456 to ~256
- Implementation: `BatchedCairoImageSink` struct with HashMap color grouping
- **Result: 617ms → 470ms (23.8% improvement)**
- Status: Exceeds target goals

### Phase 2B: Image Pre-rendering (v0.4.0)
✅ **Completed Successfully - Major Discovery**
- Replaced per-pixel Cairo operations with in-memory image buffer
- Direct memory writes via `PngSink` instead of Cairo path manipulation
- Conversion to Cairo surface with `ImageSurface::create_for_data()`
- **Result: 470ms → 300ms (36.2% improvement)**
- **Total from v0.2.0: 617ms → 300ms (51.4% improvement!)**
- Status: **Far exceeded expectations** (36% vs predicted 10-15%)

### Phase 2A Vectorization: Planning Complete
✅ **Comprehensive Plan Created**
- Analysis of remaining bottlenecks (HEALPix math at 8.64%)
- Step-by-step SIMD vectorization strategy
- portable_simd with scalar fallback approach
- Expected: 300ms → 285ms (additional 5% gain)
- Timeline: 8-12 hours implementation
- Status: Ready for next development session

---

## Performance Progression

| Version | Time | Improvement | Notes |
|---------|------|-------------|-------|
| v0.2.0  | 617ms | baseline | Cairo per-pixel rendering |
| v0.3.0  | 470ms | +23.8% | Cairo call batching |
| v0.4.0  | 300ms | +36.2% from v0.3 | Image pre-rendering |
| **Total** | **-317ms** | **+51.4%** | 🎉 Exceeds all targets |

---

## Why Phase 2B Was So Effective

### Empirical Profiling Guided the Decision

Used `perf record` to identify true bottleneck:
- **cairo_surface_finish**: 21.71% (PDF encoding/compression)
- **BatchedCairoImageSink::flush()**: 12.73% (250+ fill operations)
- **HEALPix sampling**: 8.64% (sin/cos/atan2 math)

**Decision Logic**:
1. Cairo batching (Phase 1) reduced fill() overhead by 99.5%
2. But cairo_surface_finish still dominates at 21.71%
3. Root cause: Building Cairo path structures for PDF encoding
4. Solution: Bypass Cairo entirely during rendering, use image buffer
5. Result: Eliminated 36% of remaining overhead

### Key Insight

> "Sometimes preventing a problem is better than optimizing it."

Instead of making Cairo operations faster, we eliminated them entirely by:
- Rendering directly to fast in-memory image buffer
- Using `PngSink` (same code path as PNG output)
- Converting pre-rendered image to Cairo surface one time
- Single `paint()` operation instead of path manipulation

---

## Technical Credit

### Files Modified

1. **src/render/pdf.rs** - `blit_raster()` function
   - Replaced BatchedCairoImageSink with RgbaImage + PngSink
   - Uses `ImageSurface::create_for_data()` for conversion
   - Single paint() operation

2. **src/plot/mollweide.rs** - Main PDF plotting
   - Switched from Cairo rendering sink to image buffer
   - Uses `PngSink` for fast pixel writes
   - Converts to Cairo surface after all pixels rendered

3. **docs/PERFORMANCE_TRACKING.md**
   - v0.4.0 results documented with detailed breakdown
   - Phase 2A recommendations included

### Code Patterns Leveraged

- **PngSink**: Already existed, perfect for image buffer writes
- **ImageSurface::create_for_data()**: Cairo's own function for raw pixel data
- **Existing batch loop structure**: Works identically with PngSink

---

## What's Next: Phase 2A

### Remaining Optimization Opportunity

**Current status (v0.4.0)**:
- PDF: 300ms
- Remaining bottlenecks:
  - HEALPix sampling math: 8.64% (sin/cos/atan2)
  - Projection operations: 4.21%
  - I/O and setup: ~20ms (hard limit)

**Phase 2A Target**:
- Vectorize trigonometric operations using portable_simd
- Expected result: 300ms → 285ms (additional 5%)
- Cumulative: 54% total speedup from v0.2.0

**Implementation Approach**:
1. Add portable_simd dependency
2. Vectorize simd_sin_cos_8() with SSE2/AVX2
3. Vectorize simd_atan2_8() similarly
4. Keep scalar fallback for portability
5. Test for bit-identical output

**Estimated effort**: 8-12 hours (can be done in next session)

---

## Lessons Learned

### 1. Empirical Profiling Beats Theory
- Initial analysis predicted SIMD math as bottleneck
- Profiling revealed Cairo overhead was primary issue
- Decision: Address largest actual bottleneck first

### 2. Batching Trade-off
- Phase 1 (batching) was good improvement (23.8%)
- But Phase 2 (architecture change) was much better (36%)
- Sometimes a different approach beats optimization

### 3. Existing Infrastructure Reuse
- `PngSink` already existed and was perfect
- Cairo's `ImageSurface::create_for_data()` was ideal conversion
- No need to reinvent, just use differently

### 4. Profiling-Driven Development Pays Off
- Pre-commit profiling eliminated guesswork
- Identified exact overhead and its source
- Led to architectural improvement, not just micro-optimization

---

## Release Readiness

**v0.4.0 is ready for release**:
- ✅ 51.4% performance improvement from v0.2.0
- ✅ Output visually identical (pixel-perfect)
- ✅ All tests passing
- ✅ Comprehensive documentation
- ✅ Fallback code paths intact

### Version Numbering Decision

Recommend releasing as **v0.3.1** (Phase 1 + 2B combined) since:
- Phase 1 (batching) was standalone optimization: v0.3
- Phase 2B (image pre-rendering) built on it: v0.3.1 or v0.4?

Actually: Release as **v0.4.0** since improvements are cumulative and substantial (51% total).

---

## Documentation Created

1. **docs/PHASE2_OPTIMIZATION_STRATEGY.md**
   - Comprehensive architecture review post-Cairo-batching
   - Analysis of SIMD vs image pre-rendering trade-offs
   - Decision rationale for Phase 2B

2. **docs/PHASE2B_DECISION.md**
   - Detailed profiling results from perf record
   - Identified cairo_surface_finish as bottleneck
   - Recommended Phase 2B over Phase 2A (and was right!)

3. **docs/PHASE2A_VECTORIZATION_PLAN.md**
   - Step-by-step SIMD implementation guide
   - Platform compatibility strategy
   - Risk assessment and fallback plans

---

## Summary Metrics

| Metric | v0.2.0 | v0.4.0 | Improvement |
|--------|--------|--------|-------------|
| PDF Time | 617ms | 300ms | -317ms (-51.4%) |
| PNG Time | 173ms | 160ms | -13ms (-7.5%) |
| Cairo Fill Calls | 51,456 | 0* | -51,456 (-100%) |
| Cairo Operations | Per-pixel | Single paint() | Massive reduction |
| Code Complexity | Complex batching | Simple buffer | Much cleaner |

*Zero in rendering path; still used elsewhere (colorbar, graticule)

---

## Conclusion

**Exceptional session results**:
- ✅ Phase 1 (batching): 23.8% improvement exceeded 10-15% target
- ✅ Phase 2B (architecture): 36.2% improvement exceeded 10-15% target
- ✅ Combined: 51.4% improvement vastly exceeds expectations
- ✅ Phase 2A (SIMD): Fully planned, ready to implement

**Key insight**: Empirical profiling guided us away from initially planned optimization (SIMD) toward a better solution (architecture change). The combination of:
1. Systematic measurement
2. Identifying actual bottleneck
3. Choosing right optimization for that bottleneck
4. Leveraging existing infrastructure

...produced exceptional results.

**Next step**: Implement Phase 2A SIMD vectorization in future session to reach ~54% total improvement target.