map2fig 0.7.6

Fast, publication-quality HEALPix sky map visualization in Rust
Documentation
# Detailed Analysis of FITS I/O and Downsampling

## Problem Statement
You observed that 42% of execution time is spent on FITS I/O and 29% on downsampling. The question: "Would less extreme downsampling help?" is important because it reveals the fundamental data loading bottleneck.

## Current Pipeline Analysis

### File I/O Metrics
- **File size**: 3.0 GB
- **Pixel count**: 805.3M (nside=8192) 
- **Data type**: f32 (4 bytes/pixel)
- **Observed I/O time**: 1.32 seconds (36.3% of 3.637s total)

### Bandwidth Analysis
```
Theoretical peak: 9.1 GB/s (from hardware counters)
Minimum theoretical time: 3.0 GB ÷ 9.1 GB/s = 0.33s
Actual observed: 1.32s
Efficiency: 25% of theoretical max bandwidth
```

**Why only 25% efficiency?**
1. **Memmap overhead**: Memory-mapping has kernel page table management cost
2. **Header parsing**: FITS binary headers add sequential read delays
3. **Column extraction**: Even with direct f32 binary read, coordinate conversion adds overhead
4. **Memory latency**: 89.5M L3 cache misses cause ~1.1B cycles of stalling
5. **TLB pressure**: Large 3GB file causes TLB misses on memmap navigation

### Downsampling Analysis

| Output Width | Height | Target NSIDE | Factor | Output Pixels | Data Reduction |
|---|---|---|---|---|---|
| 600 | 300 | 256 | 32x | 786K | 99.9% |
| 800 | 400 | 512 | 16x | 3.1M | 99.6% |
| **1200** | **600** | **512** | **16x** | **3.1M** | **99.6%** |
| 1600 | 800 | 1024 | 8x | 12.6M | 98.4% |
| 2400 | 1200 | 1024 | 8x | 12.6M | 98.4% |

## Downsampling Cost vs I/O

Current phase breakdown at 1200x600 resolution:
- **FITS I/O**: 1.32s (36.3%)
  - Read 805M pixels from 3.0 GB file
  - Extract f32 column with binary header parsing
  - Apply scale_factor to valid pixels
  
- **Downsampling**: 1.08s (29.7%)
  - xyf2ring coordinate transform (806M → 3.1M)  
  - Pixel averaging across downsampling factor
  - Multicore Rayon parallelization (4 cores)

- **Mollweide Projection**: 0.87s (23.9%)
  - Trigonometric math (sin/cos/atan2)
  - Grid coordinate transform to image space
  
- **Rendering**: 0.36s (9.9%)
  - Cairo PDF rasterization
  - PDF bytecode generation

## The I/O Bottleneck Root Cause

### Why Can't We Load Less Data?

The fundamental issue is **FITS format constraints**:

1. **Sequential storage**: HEALPix pixels stored in NSIDE-ordered array (NESTED or RING)
   - Cannot selectively read subset of pixels without touching entire file
   - Binary table structure requires reading full column sequentially

2. **No sparse indexing at FITS level**: 
   - FITS doesn't know which pixels are "important" for downsampling
   - Compression formats (gzip, bzip2) would require decompressing entire file anyway

3. **Coordinate transform overhead**: Previous attempts to downgrade-during-read FAILED:
   - Tier 3 optimization (Feb 2025): 25% slower due to per-pixel coord conversion
   - Math cost: ~50 CPU cycles per pixel × 805M pixels = 40B cycles
   - I/O savings: Only ~6% (memory allocation reduction)

### Why Direct f32 Binary Reading Helps

**Tier 1 Optimization (achieved)**:
- Reads f32 column directly from binary without DataValue enum conversion
- Saves: Parse full 806M-pixel vector enum → eliminate intermediate conversion
- Cost: One memcpy of 3.2 GB (unavoidable)
- Gain: 3.4× speedup on FITS phase (removed 71% overhead)

**Remaining overhead** (must be paid):
- 3.0 GB memmap read to kernel buffer (~0.33s at 9.1 GB/s)
- FITS header parsing and seek operations (~0.3s)
- TLB/cache management (~0.35s)
- L3 cache misses to main memory (89.5M misses × latency)
- **Total: 1.32s hard floor**

## Could Less Downsampling Help?

### Scenario A: Smaller Output (e.g., 600×300 vs 1200×600)
- **Reduction**: 32× downsampling factor (vs 16×)
- **Pixels kept**: 786K (vs 3.1M)
- **I/O unchanged**: Still 3.0 GB (must read all data!)
- **Downsampling time**: Slightly faster (-5-10%), but still paying full I/O
- **Expected total**: ~3.4s (maybe 0.1-0.2s saved)
- **Benefit**: MINIMAL—I/O dominates

### Scenario B: Larger Output (e.g., 2400×1200)
- **Reduction**: 8× downsampling factor (vs 16×)
- **Pixels kept**: 12.6M (vs 3.1M) - 4× more data
- **I/O unchanged**: Still 3.0 GB
- **Downsampling time**: 4× longer (more pixels to process)
- **Rendering time**: Longer (larger output image)
- **Expected total**: ~4.0-4.2s
- **Benefit**: NEGATIVE—worse performance!

### Why? The I/O Dominates

```
Total time ≈ I/O (1.32s) + Downsampling (depends on output size)

For 1200px:  1.32s + 1.08s = 2.40s data phase → 3.64s total
For 600px:   1.32s + 0.95s = 2.27s data phase → 3.50s total (saving 140ms)
For 2400px:  1.32s + 1.45s = 2.77s data phase → 3.95s total (cost 310ms)
```

**The I/O cost is fixed**. Downsampling factor barely matters because:
- Tier 2 optimization (prefetch hints): Only 3.2% gain
- Tight inner loop limited by memory bandwidth, not CPU


## Optimization Opportunities

### ✅ Already Exploited
1. **Direct f32 binary read** (Tier 1): 3.4× speedup
2. **Memmap I/O**: 20-21% speedup
3. **Percentile streaming**: 79% memory reduction, 49% faster scaling
4. **Prefetch hints**: 3.2% speedup (L3-aware optimization)
5. **Generic is_seen()**: 4.7% speedup (eliminated f32→f64)

### 🔴 Dead Ends (Proven Ineffective)
1. **Downgrade-during-read**: 25% SLOWER—coordinate conversion overhead exceeds I/O savings
2. **Spatial tiling**: 12% REGRESSION—interferes with NESTED indexing, Rayon overhead
3. **Precision reduction (f64→f32)**: 2-3% SLOWER—conversion overhead, math already optimized

### 🟡 Remaining Opportunities (But Difficult)

**Hard Limits:**
- I/O bandwidth: 9.1 GB/s (hardware constraint)
- Memory latency: 89.5M L3 misses = 1.1B cycles (unavoidable)
- Theoretical minimum: 0.33s for 3.0 GB file (currently 4× slower)

**GPU Acceleration (5-10× potential)**:
- Target: Downsampling xyf2ring transform (1.08s)
- Approach: CUDA/HIP kernel for parallel downgrade_healpix_map
- Difficulty: NEW (SDK dependency, porting complexity)
- ROI: Best option if available

**Improved I/O Pipeline (20-30% potential)**:
- Use async I/O to start downsampling while still reading
- Requires: Overlapping kernel I/O + page fault handling
- Difficulty: VERY HIGH (OS kernel API, memory safety concerns)
- ROI: Complex for marginal gain

**Compress FITS Before Distribution**:
- Users manage file sizes better
- Your tool reads faster (smaller < 1 GB file)
- Downside: Requires pre-compression step
- ROI: Best for users with many files to process

## Recommendations

### For v0.7.5 (Current State)
✅ Accept I/O bottleneck as hardware-limited  
✅ Published results show 4.7% improvement from is_seen() optimization  
✅ Throughput: 0.83 maps/second on large files  

### For Future (v0.8+)
1. **Primary focus**: GPU acceleration for downsampling
   - Could save 1.0-1.5 seconds (25-35% total improvement)
   - Aligns with algorithm (downsampling is embarrassingly parallel)
   
2. **Secondary focus**: Better I/O prefetching strategy
   - Study actual memory access patterns during FITS → downsampling
   - May yield 5-10% gain through cache reuse
   
3. **User-focused**: Recommend users compress FITS with:
   ```bash
   fpack -y cosmoglobe.fits  # Compress to .fits.gz
   # Then your tool processes 1 GB instead of 3 GB faster
   ```

## Conclusion

**The 42% I/O cost is not a performance bug—it's a physics problem.**

The fundamental issue is that FITS files are optimized for archival storage and scientific completeness, not for fast visualization on specific hardware. To process a 3 GB file with 9.1 GB/s memory bandwidth limited by L3 cache misses and TLB pressure, 1.32 seconds is near-optimal.

Downsampling from 16× reduction to 32× saves only ~100-150ms because the I/O dominates. The real opportunities lie in:
1. **GPU acceleration** (biggest potential, but complex)
2. **User-side compression** (easiest, already effective)
3. **Accept current performance** (3.6s is competitive for this workload)

The platform-specific limitations (L3 bandwidth, TLB size, memmap overhead) are constraints from the hardware itself, not the algorithm.