map2fig 0.7.2

Fast, publication-quality HEALPix sky map visualization in Rust
Documentation
# map2fig v0.7.0 Release Notes

**Release Date:** February 18, 2026

## Summary

This release focuses on code quality, performance optimization validation, and repository organization. Achieves **3.2% performance improvement** through optimized prefetch hints while maintaining comprehensive documentation of all optimization work.

## Major Changes

### ✅ Code Quality Improvements
- Fixed all Clippy warnings (7 instances across multiple files)
  - Converted explicit loop counters to `.enumerate()`
  - Removed unused imports and variables
  - Simplified format!() calls
  - Reorganized code structure for proper module ordering
- **All 180 unit tests passing** (0 failures, 2 ignored)
- `cargo fmt` formatting verified
- Clean build: `cargo clippy --all-targets --all-features -- -D warnings`

### 🔬 Performance Optimization Completion (Tier 5 - Prefetch)
- **+3.2% wall-clock improvement** on nside=8192 maps (7.502s → 7.263s)
- Implemented x86_64 prefetch hints in downsampling inner loop
- Perf profiling validation confirms 7.68% prefetch cost in call-graph (uses idle CPU time)
- Benchmark: 5 runs with ±0.192s std dev (tighter than baseline)

### 📚 Repository Organization
- Created comprehensive documentation structure
  - `docs/optimization/` - All optimization analyses
  - `docs/performance/` - Performance profiling results
  - `docs/dev/` - Development notes
  - Moved test artifacts to `target/` directory
- Updated INDEX.md with new documentation links
- `.github/copilot-instructions.md` includes latest optimization results

### 📝 Detailed Optimization Documentation
- **PREFETCH_OPTIMIZATION_RESULTS.md** - Validates +3.2% improvement
- **TILING_OPTIMIZATION_FAILURE_ANALYSIS.md** - Documents why spatial tiling caused -12% regression
- **DOWNSAMPLING_OPTIMIZATION_SESSION_FEB2026.md** - Full session replay with lessons learned
- **OPTIMIZATION_AUDIT_2026.md** - Updated with Feb 17-18 work

---

# map2fig v0.6.0 Release Notes

**Release Date:** February 17, 2026

## Overview

This release focuses on performance optimization, benchmarking infrastructure, and bottleneck analysis. It achieves **16.8% performance improvement** on large files (3.1GB: 14.1s → 11.7s) while establishing comprehensive benchmarking suite to prevent performance regressions and guide future optimization.

## Major Changes & Improvements

### 🚀 Performance Optimization (16.8% improvement on large files)

**File Size Improvements**:
```
File Size | Version 0.5.0 | Version 0.6.0 | Improvement
──────────────────────────────────────────────────────
6MB       | 369.4ms       | 316.9ms       | 14.1% ✅
24MB      | 513.9ms       | 500.2ms       | 2.7%
72MB      | 523.0ms       | 498.6ms       | 4.7%
192MB     | 800.1ms       | 841.4ms       | —
576MB     | 845.0ms       | 806.0ms       | 4.7%
3.1GB     | 14118ms       | 11709ms       | 16.8% ✅
```

**Optimizations Applied**:
1. **Coordinate Lookup Caching (LRU)** - 14.1% on small files, 1.1% on large files
   - Cache: 10K entries per function (pix2ang_ring, pix2ang_nest, ang2pix_ring, ang2pix_nest)
   - Memory overhead: ~320KB total
   - Thread-safe with parking_lot RwLock

2. **Column Cache Bug Fix** - Prevents pathological 2-3GB+ file handling
   - Issue: Files >1GB were cached and never deleted, causing memory exhaustion
   - Fix: Skip caching for files >1GB, avoiding 45GB+ memory usage on huge maps
   - Impact: Large file stability improved

3. **SIMD Operations** - Ready-to-use SIMD traits via `wide` crate
   - Provides 8× vectorized operations (sin, cos, atan2, sqrt, etc.)
   - Fallback on stable Rust via delegation to wide crate
   - Available for future scaling optimizations

### 📊 Comprehensive Benchmarking Infrastructure ✅

**New Benchmarking System**:
```
Tool         | Purpose                  | Location
──────────────────────────────────────────────────────
Hyperfine    | End-to-end (6 files)    | benches/hyperfine_benchmarks.sh
Criterion    | Micro-benchmarks        | benches/criterion_benchmarks.rs
Divan        | Cycle-accurate          | benches/divan_benchmarks.rs
Python       | Detailed pipeline timing | benches/detailed_profile.py
Script       | Unified runner           | benches/run_benchmarks.sh
CI/CD        | Regression detection     | .github/workflows/benchmarks.yml
```

**Benchmark Results (Large File - 3.1GB, nside=8192)**:
```
Operation       | Time  | % of Total | Status
────────────────────────────────────────────────
FITS Reading    | 10.9s | 81%       | Bottleneck
Mollweide Proj  | 1.3s  | ~10%      | Secondary
Cairo Render    | ~0.3s | ~2%       | Minimal
Total Time      | 11.7s | 100%      | Optimized
```

**Key Metrics**:
- Variance on 11.7s baseline: ±0.1s (0.87% - excellent stability)
- Sample size: 5 runs per benchmark + 1 warmup
- Confidence level: 95% CI
- Benchmarks: 6-file suite covering 6MB to 3.1GB

**Documentation**: `docs/current/BENCHMARKING_SETUP.md` - Complete setup and usage guide

### 🔍 Bottleneck Identification

**Current State**:
1. FITS Reading: **81% of load time** (10.9s of 13.4s)
   - Limited parallelization (files read sequentially on disk)
   - Streaming reader could help for streaming FITS files
   - Currently limited by I/O patterns and memory bandwidth

2. Mollweide Projection: **~10% of total time**
   - Vectorization potential: 3-4× with SIMD (Tier 2)
   - GPU acceleration: 2.5-2.8× (Tier 3)
   - Requires algorithmic changes

3. Scaling Operations: **<1% of total time**
   - SIMD implemented but shows minimal improvement
   - No further optimization worthwhile

**Impact Analysis**:
- Optimizing projection (Tier 2 SIMD): Limited ROI due to FITS bottleneck
- Optimizing scaling (Tier 3): Negligible ROI (<0.5% total improvement)
- Further CPU optimizations without addressing FITS reading will have minimal impact

### 🛠️ Code Quality Improvements

- ✅ All clippy warnings fixed (feature gate, unused imports, unit_arg)
- ✅ Stable Rust compatibility (removed nightly feature attempts)
- ✅ Benchmark code cleanup (proper black_box usage)
- ✅ Documentation updated (SIMD module, benchmarking setup)
- ✅ Test suite: 171/173 passing (2 ignored Hammer tests - expected)

## Backward Compatibility

✅ **Fully backward compatible** with v0.5.0
- No API changes
- No data format changes
- Performance improvements are transparent

## Known Limitations

1. **Large File Performance**: 3.1GB files process in ~11.7s
   - Bottleneck: Sequential disk I/O, not removable without architectural change
   - Workaround: Use smaller FITS subsets if needed

2. **Hammer Projection**: 2 roundtrip tests skipped
   - Known limitation: Inverse transform not implemented
   - Will be addressed in v0.7.0

## Testing & Validation

- ✅ All unit tests passing (171/173, 2 intentionally ignored)
- ✅ Integration tests verified
- ✅ Performance regression suite configured
- ✅ 6-file benchmark suite established
- ✅ CI/CD pipeline with benchmarks enabled

## Performance Roadmap

### Tier 2: SIMD Vectorization (3-4× potential)
- Expected: 15-25% real-world improvement on large files
- Effort: 4-6 hours
- Status: Blocked by FITS reading bottleneck

### Tier 3: GPU Acceleration (2.5-2.8× potential)  
- Expected: 30-50% improvement with proper pipelining
- Effort: 40+ hours design and implementation
- Status: Future consideration

### Streaming FITS Reader
- Expected: 20-30% improvement if overlapped with rendering
- Effort: 8-12 hours
- Status: Recommended next step to unlock further optimizations

## Commits & Transactions

- **3 commits** in Feb 17, 2026 session
- 113+ files reorganized into docs/ structure
- Benchmarking infrastructure fully integrated
- All changes tested and verified

## Upgrading from v0.5.0

No changes required. Simply update Cargo.toml:
```toml
[dependencies]
map2fig = "0.6.0"
```

All features and APIs are identical to v0.5.0. Performance improvements are automatic.

## Acknowledgments

Performance optimization guided by:
- Statistical benchmarking with Hyperfine/Criterion
- CPU profiling (perf) to identify bottlenecks
- Systematic hypothesis testing and validation
- Comprehensive documentation of findings

## Next Steps

1. **Monitor performance**: Use benchmarking suite to track regressions
2. **Investigate streaming**: Evaluate benefit of overlapping I/O and rendering
3. **Plan GPU work**: Design CUDA kernel for Mollweide projection
4. **Community feedback**: Happy to discuss optimization strategies

---

**Previous Release**: [v0.5.0 Release Notes](https://github.com/dncnwtts/map2fig/releases/tag/v0.5.0)

**Full Documentation**: See [docs/](docs/) for detailed guides and benchmarking results
- `clap` 4.5.x - Command-line argument parsing

## Breaking Changes
None. This is a fully backward-compatible release.

## Installation

```bash
cargo install map2fig@0.5.0
```

Or build from source:
```bash
git clone https://github.com/dncnwtts/map2fig.git
cd map2fig && git checkout v0.5.0
cargo build --release
```

## Usage Examples

```bash
# Basic Mollweide projection
map2fig -f cosmoglobe.fits -o map.pdf

# With custom scaling and colormap
map2fig -f data.fits --log --min 1e-6 --max 1e-3 -c plasma --gamma 0.8

# Hammer projection with symlog scaling
map2fig -f data.fits --hammer --symlog

# Gnomonic projection with native resolution
map2fig -f data.fits --gnomonic --native --width 2000
```

## Documentation
See [docs/optimization/](docs/optimization/) for detailed performance analysis including:
- Tier 1-3 implementation details
- Benchmarking methodology and results
- Performance ceiling analysis
- Memory profiling data

## Contributors
- Duncan Watts (@dncnwtts) - Core development and optimization

## License
MIT

---

**Next Steps:** The next major optimization target is tile-based parallelization for multi-core speedup on 4+ core systems. GPU acceleration via CUDA is also planned for maximum speedup potential.