profile-inspect 0.1.3

# Profile Analyzer Tool Implementation Plan v2

## Overview
Build a Rust CLI tool (`profile-inspect`) that converts V8 .cpuprofile and .heapprofile files to human-readable formats and identifies hot paths for optimization.

## User Requirements
- **Project location**: `/Users/qing/p/github/profile-inspecting/`
- **Node internals filtering**: Filter out by default, show with `--include-internals` flag
- **Source map support**: Full support - parse .map files to resolve minified names
- **Output format**: Structured markdown with clear sections (tables, headers, bullet points)
- **Multiple outputs**: Support generating multiple formats simultaneously

## Architecture

```
profile-inspect/
├── Cargo.toml
├── src/
│   ├── main.rs                 # CLI entry point (clap)
│   ├── lib.rs                  # Library root
│   │
│   ├── types/                  # Raw V8 profile types
│   │   ├── mod.rs
│   │   ├── cpu.rs              # V8 CPU profile JSON types
│   │   └── heap.rs             # V8 heap profile JSON types
│   │
│   ├── ir/                     # Normalized Intermediate Representation
│   │   ├── mod.rs
│   │   ├── frame.rs            # Frame: name, file, line, col, kind, category
│   │   ├── stack.rs            # Stack: Vec<FrameId>
│   │   └── sample.rs           # Sample: timestamp, stack_id, weight
│   │
│   ├── parser/                 # V8 profile → IR conversion
│   │   ├── mod.rs
│   │   ├── cpu_profile.rs      # CPU profile parser
│   │   └── heap_profile.rs     # Heap profile parser
│   │
│   ├── classify/               # Frame classification
│   │   ├── mod.rs
│   │   └── frame_classifier.rs # Categorize: App, Deps, NodeInternal, V8, Native
│   │
│   ├── sourcemap/              # Source map resolution
│   │   ├── mod.rs
│   │   └── resolver.rs         # Minified → original location mapping
│   │
│   ├── analysis/               # Analysis algorithms
│   │   ├── mod.rs
│   │   ├── cpu_analysis.rs     # Self-time, total-time, hot paths
│   │   ├── heap_analysis.rs    # Allocation analysis
│   │   ├── diff.rs             # Profile comparison (A vs B)
│   │   └── caller_callee.rs    # Call graph attribution
│   │
│   └── output/                 # Output formatters
│       ├── mod.rs              # OutputFormat trait
│       ├── markdown.rs         # Markdown (default, LLM-optimized)
│       ├── text.rs             # Plain text
│       ├── json.rs             # JSON summary (for CI)
│       ├── speedscope.rs       # Speedscope JSON
│       └── collapsed.rs        # Collapsed stacks (flamegraph)
```

## Intermediate Representation (IR)

The IR normalizes both CPU and heap profiles into a common structure that all outputs consume:

```rust
// src/ir/frame.rs
pub struct Frame {
    pub id: FrameId,
    pub name: String,                    // Function name (resolved via sourcemap if available)
    pub file: Option<String>,            // Source file path
    pub line: Option<u32>,
    pub col: Option<u32>,
    pub kind: FrameKind,                 // Function, Native, GC, Eval, Wasm, etc.
    pub category: FrameCategory,         // App, Deps, NodeInternal, V8Internal, Native
    pub minified_name: Option<String>,   // Original minified name if resolved
    pub minified_location: Option<String>,
}

pub enum FrameKind { Function, Native, GC, Eval, Wasm, Builtin, RegExp, Idle, Unknown }
pub enum FrameCategory { App, Deps, NodeInternal, V8Internal, Native }

// src/ir/sample.rs
pub struct Sample {
    pub timestamp_us: u64,
    pub stack_id: StackId,
    pub weight: u64,  // Time delta (CPU) or bytes (Heap)
}

// src/ir/stack.rs
pub struct Stack {
    pub id: StackId,
    pub frames: Vec<FrameId>,  // Root to leaf order
}
```

## CLI Interface

```bash
# Basic usage (markdown output by default)
profile-inspect cpu input.cpuprofile
profile-inspect heap input.heapprofile

# Multiple output formats simultaneously
profile-inspect cpu input.cpuprofile -f markdown -f speedscope -f json -o ./output/

# Filtering and grouping
profile-inspect cpu input.cpuprofile --group-by function   # (default)
profile-inspect cpu input.cpuprofile --group-by file
profile-inspect cpu input.cpuprofile --group-by module
profile-inspect cpu input.cpuprofile --filter app          # Only app code
profile-inspect cpu input.cpuprofile --filter deps         # Only node_modules
profile-inspect cpu input.cpuprofile --include-internals   # Include node/v8 internals

# Focus and exclude patterns
profile-inspect cpu input.cpuprofile --focus "tailwind|prettier"
profile-inspect cpu input.cpuprofile --exclude "node_modules"

# Noise control
profile-inspect cpu input.cpuprofile --min-percent 1.0     # Drop <1% functions
profile-inspect cpu input.cpuprofile --top 30              # Show top 30

# Source map resolution
profile-inspect cpu input.cpuprofile --sourcemap-dir ./dist/

# Profile comparison (diff)
profile-inspect diff before.cpuprofile after.cpuprofile

# Explain a specific function (show callers/callees)
profile-inspect explain input.cpuprofile --function "getClassOrder"

# Output files generated:
#   ./output/profile-analysis.md        (markdown)
#   ./output/profile-analysis.json      (json summary)
#   ./output/profile.speedscope.json    (speedscope)
#   ./output/profile.collapsed.txt      (collapsed stacks)
```

## Output Formats

### 1. Markdown (Default)
- GitHub-flavored markdown with tables
- Time breakdown by category
- Collapsible sections for details
- Hot paths with call trees
- Caller/callee attribution
- **File**: `profile-analysis.md`

### 2. JSON Summary
- Machine-readable for CI/automation
- Performance regression checks
- Includes all metrics and hot paths
- **File**: `profile-analysis.json`

### 3. Speedscope JSON
- Compatible with https://www.speedscope.app/
- Interactive flame chart visualization
- **File**: `profile.speedscope.json`

### 4. Collapsed Stacks
- Brendan Gregg format for flamegraph tools
- Compatible with `flamegraph.pl`, `inferno`
- **File**: `profile.collapsed.txt`

### 5. Plain Text
- ASCII tables for terminal viewing
- **File**: `profile-analysis.txt`

## Key Algorithms

### CPU Analysis (Stack-Based Aggregation)

**Important**: Reconstruct stacks from samples, don't rely on DFS over node children.

```rust
pub fn analyze_cpu_profile(profile: &CpuProfile) -> CpuAnalysis {
    // 1. Build parent map: node_id -> parent_id
    let parent_map = build_parent_map(&profile.nodes);

    // 2. For each sample, reconstruct full stack
    let mut self_times: HashMap<FrameId, u64> = HashMap::new();
    let mut total_times: HashMap<FrameId, u64> = HashMap::new();
    let mut stack_weights: HashMap<Vec<FrameId>, u64> = HashMap::new();

    for (i, &sample_node) in profile.samples.iter().enumerate() {
        let weight = profile.time_deltas.get(i).copied().unwrap_or(0).max(0) as u64;

        // Reconstruct stack: leaf -> root, then reverse
        let stack = reconstruct_stack(sample_node, &parent_map);

        // Leaf gets self-time
        if let Some(&leaf) = stack.last() {
            *self_times.entry(leaf).or_default() += weight;
        }

        // All frames on stack get total-time (inclusive)
        for &frame in &stack {
            *total_times.entry(frame).or_default() += weight;
        }

        // Aggregate by unique stack
        *stack_weights.entry(stack).or_default() += weight;
    }
    // ...
}
```

### Frame Classification

```rust
impl FrameClassifier {
    pub fn classify(&self, url: &str, name: &str) -> FrameCategory {
        // Node internals
        if url.starts_with("node:") || url.contains("internal/") {
            return FrameCategory::NodeInternal;
        }

        // V8 internals
        if name.starts_with("(") && name.ends_with(")") {
            match name {
                "(garbage collector)" | "(idle)" | "(program)" => {
                    return FrameCategory::V8Internal;
                }
                _ => {}
            }
        }

        // Native/Builtin
        if url.is_empty() || name.contains("Builtin:") || name == "(native)" {
            return FrameCategory::Native;
        }

        // Dependencies
        if url.contains("node_modules") {
            return FrameCategory::Deps;
        }

        // App code (default)
        FrameCategory::App
    }
}
```

### Profile Diff

```rust
pub struct ProfileDiff {
    pub before_total: u64,
    pub after_total: u64,
    pub regressions: Vec<FunctionDelta>,  // Got slower
    pub improvements: Vec<FunctionDelta>, // Got faster
    pub new_hotspots: Vec<FunctionStats>, // Didn't exist before
}

pub struct FunctionDelta {
    pub name: String,
    pub location: String,
    pub before_time: u64,
    pub after_time: u64,
    pub delta_percent: f64,  // Positive = regression
}
```

## Dependencies

```toml
[dependencies]
clap = { version = "4", features = ["derive"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
thiserror = "2"
sourcemap = "9"              # Source map parsing
regex = "1"                  # Pattern matching for filters
indexmap = { version = "2", features = ["serde"] }  # Ordered maps

[dev-dependencies]
insta = "1"                  # Snapshot testing
```

## Files to Create

### Core Types & IR (10 files)
1. `Cargo.toml` - Project manifest
2. `src/main.rs` - CLI entry point
3. `src/lib.rs` - Library exports
4. `src/types/mod.rs` - V8 types exports
5. `src/types/cpu.rs` - Raw V8 CPU profile types
6. `src/types/heap.rs` - Raw V8 heap profile types
7. `src/ir/mod.rs` - IR exports
8. `src/ir/frame.rs` - Frame IR (FrameKind, FrameCategory)
9. `src/ir/stack.rs` - Stack IR
10. `src/ir/sample.rs` - Sample IR

### Parsing & Classification (7 files)
11. `src/parser/mod.rs` - Parser exports
12. `src/parser/cpu_profile.rs` - CPU profile → IR
13. `src/parser/heap_profile.rs` - Heap profile → IR
14. `src/classify/mod.rs` - Classifier exports
15. `src/classify/frame_classifier.rs` - Frame categorization rules
16. `src/sourcemap/mod.rs` - Source map exports
17. `src/sourcemap/resolver.rs` - Location resolution + caching

### Analysis (4 files)
18. `src/analysis/mod.rs` - Analysis exports
19. `src/analysis/cpu_analysis.rs` - CPU hot path detection
20. `src/analysis/heap_analysis.rs` - Heap allocation analysis
21. `src/analysis/diff.rs` - Profile comparison
22. `src/analysis/caller_callee.rs` - Call graph attribution

### Output (6 files)
23. `src/output/mod.rs` - OutputFormat trait + registry
24. `src/output/markdown.rs` - Markdown formatter
25. `src/output/text.rs` - Plain text formatter
26. `src/output/json.rs` - JSON summary formatter
27. `src/output/speedscope.rs` - Speedscope formatter
28. `src/output/collapsed.rs` - Collapsed stacks formatter

**Total: 28 files**

## Source Map Support

### Resolution Strategy
1. Extract URL from profile (e.g., `dist/dist--xku3Vyh.js:141153`)
2. Normalize URL schemes (`file://`, `webpack://`, `vite://`, `node:`)
3. Check for adjacent `.map` file
4. Parse inline sourcemaps (base64 data URLs)
5. Handle missing column info (line-only fallback with warning)
6. Cache resolved maps by URL + mtime

### Output Integration
- Display original names when resolved
- Show minified mapping in collapsible section
- Group frames by original source file

## Verification

```bash
# 1. Basic parsing
cargo run -- cpu test-runner-profile/CPU.*.cpuprofile
cargo run -- heap test-runner-profile/Heap.*.heapprofile

# 2. Multiple outputs
cargo run -- cpu test-runner-profile/CPU.*.cpuprofile -f markdown -f speedscope -f json -o ./out/

# 3. Filtering
cargo run -- cpu test-runner-profile/CPU.*.cpuprofile --filter app --min-percent 1.0

# 4. Source maps
cargo run -- cpu test-runner-profile/CPU.*.cpuprofile \
  --sourcemap-dir /Users/qing/p/github/oxc_formatter/apps/oxfmt/dist/

# 5. Validate speedscope output at https://www.speedscope.app/
```

## Expected Markdown Output

```markdown
# CPU Profile Analysis

**File:** `CPU.20260124.005938.62230.0.001.cpuprofile`
**Duration:** 2731.38 ms | **Samples:** 2,198 | **Interval:** ~1.24 ms

## Time Breakdown by Category

| Category | Time | % |
|----------|------|---|
| App code | 1,150 ms | 42.1% |
| Dependencies (node_modules) | 846 ms | 31.0% |
| Node internals | 491 ms | 18.0% |
| V8/Native | 244 ms | 8.9% |

## Top Functions by Self Time

| # | Self Time | % | Total Time | Function | Location |
|---|-----------|---|------------|----------|----------|
| 1 | 450.23 ms | 16.5% | 1,234.56 ms | `getClassOrder` | `tailwind.ts:245:12` |
| 2 | 312.45 ms | 11.4% | 892.34 ms | `parseExpression` | `parser.ts:1892:5` |

<details>
<summary>Minified name mappings</summary>

| Original | Minified | Minified Location |
|----------|----------|-------------------|
| `getClassOrder` | `Bae` | `dist--xku3Vyh.js:141153` |

</details>

## Hot Paths

### Path #1 (12.3% of total, 336.12 ms)

```
sortTailwindClasses (prettier-plugin.ts:42)
└─ map (native)
   └─ processClass (tailwind.ts:189)
      └─ getClassOrder (tailwind.ts:245) ← HOTSPOT
```

## Caller/Callee Analysis for `getClassOrder`

**Top Callers:**
| Caller | Calls | Time Attributed |
|--------|-------|-----------------|
| `processClass` | 1,234 | 380 ms |
| `sortClasses` | 456 | 70 ms |

**Top Callees:**
| Callee | Self Time |
|--------|-----------|
| `compileCss` | 180 ms |
| `lookupRule` | 95 ms |

## Recommendations

### Critical (>20% impact)

- **`getClassOrder`** at `tailwind.ts:245:12`
  - 16.5% of total CPU time (450.23ms)
  - Called from `processClass` and `sortClasses`
  - **Suggestion:** Cache results by class name

### High (10-20% impact)

- **`parseExpression`** at `parser.ts:1892:5`
  - 11.4% of total CPU time (312.45ms)
  - **Suggestion:** Consider if re-parsing can be avoided

## Agent Summary

**Key Bottlenecks:**
1. `getClassOrder` (Tailwind CSS class ordering) - 16.5%
2. `parseExpression` (Babel parsing) - 11.4%

**Time Distribution:**
- 42% in app code - good target for optimization
- 31% in dependencies - consider upgrading or replacing slow deps
- 27% in internals/native - likely unavoidable

**Suggested Optimizations:**
- Cache `getClassOrder` results - classes repeat frequently
- Investigate if Babel parsing can be cached or avoided
```