# Phase 14: Corpus Analysis (`pxl analyze`)
**Goal:** Add a command to analyze pixelsrc files and extract metrics that inform future primitive development and format optimization
**Status:** Complete
**Depends on:** Phase 0 (Core CLI exists)
---
## Scope
Phase 14 adds:
- `pxl analyze` command for corpus analysis
- Token frequency and co-occurrence tracking
- Structural pattern detection (outline, symmetry, gradient)
- Dimensional analysis (sprite sizes, aspect ratios)
- Compression opportunity estimation (RLE potential)
- JSON and text output formats
**Not in scope:** Visualization, watch mode, primitive suggestion, comparison diffs
---
## Motivation
Before designing primitives (common patterns, shapes, reusable components), we need data about actual usage patterns. This phase adds tooling to analyze a corpus of pixelsrc files and surface insights about:
- Which tokens are most common
- What structural patterns emerge (outlines, symmetry, gradients)
- Common dimensions and aspect ratios
- Palette reuse patterns
- Compression opportunities (run-length encoding, row repetition)
This data-driven approach ensures primitives emerge from real usage rather than speculation.
---
## Command Interface
```bash
pxl analyze <files...> # Analyze specific files
pxl analyze --dir <path> # Analyze all .jsonl in directory
pxl analyze --recursive <path> # Include subdirectories
```
---
## Task Dependency Diagram
```
PHASE 14 TASK FLOW
═══════════════════════════════════════════════════════════════════
PREREQUISITE
┌─────────────────────────────────────────────────────────────────┐
│ Phase 0 Complete │
└─────────────────────────────────────────────────────────────────┘
│
▼
WAVE 1 (Foundation)
┌─────────────────────────────────────────────────────────────────┐
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 14.1 Core Analysis Infrastructure │ │
│ │ - AnalysisReport struct │ │
│ │ - File collection (single, dir, recursive) │ │
│ │ - Basic CLI integration │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
WAVE 2 (Parallel - Analysis Modules)
┌─────────────────────────────────────────────────────────────────┐
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ 14.2 │ │ 14.3 │ │ 14.4 │ │
│ │ Token │ │ Structural │ │ Dimensional │ │
│ │ Analysis │ │ Analysis │ │ Analysis │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ │ │
└────────────────┬┴─────────────────┘
│
▼
WAVE 3 (After Wave 2)
┌─────────────────────────────────────────────────────────────────┐
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 14.5 Compression Estimation │ │
│ │ - Run-length encoding opportunity calculation │ │
│ │ - Row repetition detection │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
WAVE 4 (Output & Polish)
┌─────────────────────────────────────────────────────────────────┐
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ 14.6 Output Formatting & CLI Polish │ │
│ │ - JSON output mode │ │
│ │ - Text report formatting │ │
│ │ - File output option │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════════
PARALLELIZATION SUMMARY:
┌─────────────────────────────────────────────────────────────────┐
│ Wave 1: 14.1 (1 task) │
│ Wave 2: 14.2 + 14.3 + 14.4 (3 tasks in parallel) │
│ Wave 3: 14.5 (1 task, needs token data) │
│ Wave 4: 14.6 (1 task, needs all above) │
└─────────────────────────────────────────────────────────────────┘
```
---
## Tasks
### Task 14.1: Core Analysis Infrastructure
**Wave:** 1
Create the foundation for analysis functionality.
**Deliverables:**
- New file `src/analyze.rs`:
```rust
use std::path::PathBuf;
use std::collections::HashMap;
#[derive(Debug, Default)]
pub struct AnalysisReport {
pub files_analyzed: usize,
pub total_sprites: usize,
pub total_palettes: usize,
pub total_compositions: usize,
pub token_frequency: HashMap<String, usize>,
pub token_cooccurrence: HashMap<(String, String), usize>,
pub dimensions: HashMap<(u32, u32), usize>,
pub structural_patterns: StructuralPatterns,
pub compression_stats: CompressionStats,
}
#[derive(Debug, Default)]
pub struct StructuralPatterns {
pub has_outline: usize,
pub horizontal_symmetry: usize,
pub vertical_symmetry: usize,
pub uses_gradients: usize,
pub transparency_border: usize,
}
#[derive(Debug, Default)]
pub struct CompressionStats {
pub avg_tokens_per_row: f64,
pub avg_unique_per_row: f64,
pub avg_runs_per_row: f64,
pub estimated_compression_ratio: f64,
}
pub fn collect_files(paths: &[PathBuf], recursive: bool) -> Vec<PathBuf>
pub fn analyze_files(files: &[PathBuf]) -> AnalysisReport
```
- Update `src/cli.rs`:
```rust
#[derive(Subcommand)]
enum Commands {
Analyze(AnalyzeArgs),
}
#[derive(Args)]
struct AnalyzeArgs {
#[arg(required_unless_present = "dir")]
files: Vec<PathBuf>,
#[arg(long)]
dir: Option<PathBuf>,
#[arg(long, short)]
recursive: bool,
#[arg(long, default_value = "text")]
format: OutputFormat,
#[arg(long, short)]
output: Option<PathBuf>,
}
#[derive(Clone, ValueEnum)]
enum OutputFormat {
Text,
Json,
}
```
- Update `src/lib.rs` to export `analyze` module
**Verification:**
```bash
cargo build
./target/release/pxl analyze --help
# Should show: pxl analyze [OPTIONS] [FILES]... with --dir, --recursive, --format, --output
./target/release/pxl analyze examples/
# Should run without error (may show empty report)
```
**Test Fixture:** `tests/fixtures/valid/analyze_corpus/` directory with 3+ sample sprites
**Dependencies:** Phase 0 complete
---
### Task 14.2: Token Analysis
**Wave:** 2 (parallel with 14.3, 14.4)
Implement token frequency counting and co-occurrence tracking.
**Deliverables:**
- Add to `src/analyze.rs`:
```rust
pub struct TokenAnalyzer {
frequency: HashMap<String, usize>,
cooccurrence: HashMap<(String, String), usize>,
sprites_with_token: HashMap<String, usize>,
}
impl TokenAnalyzer {
pub fn new() -> Self
pub fn analyze_sprite(&mut self, sprite: &Sprite)
pub fn top_tokens(&self, n: usize) -> Vec<(&str, usize)>
pub fn top_cooccurrences(&self, n: usize) -> Vec<((&str, &str), usize)>
pub fn token_families(&self) -> Vec<Vec<&str>> }
```
- Track:
- Token frequency across all sprites
- Token co-occurrence (pairs appearing in same palette)
- Token families (similar names detected via prefix matching)
**Verification:**
```bash
cargo test analyze::token
# Test: Empty sprite → empty results
# Test: Single sprite → correct token counts
# Test: Multiple sprites → aggregated counts
# Test: Co-occurrence matrix builds correctly
```
**Test Fixture:** `tests/fixtures/valid/analyze_tokens.jsonl`
```jsonl
{"type": "palette", "name": "test", "colors": {"{_}": "#00000000", "{skin}": "#FFCC99", "{skin_shadow}": "#CC9966", "{hair}": "#8B4513"}}
{"type": "sprite", "name": "char1", "size": [4, 4], "palette": "test", "grid": ["{_}{skin}{skin}{_}", "{skin}{skin}{skin}{skin}", "{skin_shadow}{skin}{skin}{skin_shadow}", "{_}{skin_shadow}{skin_shadow}{_}"]}
{"type": "sprite", "name": "char2", "size": [4, 4], "palette": "test", "grid": ["{_}{hair}{hair}{_}", "{hair}{skin}{skin}{hair}", "{_}{skin}{skin}{_}", "{_}{_}{_}{_}"]}
```
**Dependencies:** Task 14.1
---
### Task 14.3: Structural Analysis
**Wave:** 2 (parallel with 14.2, 14.4)
Implement pattern detection for common sprite structures.
**Deliverables:**
- Add to `src/analyze.rs`:
```rust
pub struct StructuralAnalyzer;
impl StructuralAnalyzer {
pub fn has_outline(sprite: &Sprite) -> bool
pub fn is_horizontally_symmetric(sprite: &Sprite) -> bool
pub fn is_vertically_symmetric(sprite: &Sprite) -> bool
pub fn has_gradient(sprite: &Sprite) -> bool
pub fn has_transparency_border(sprite: &Sprite) -> bool
pub fn analyze(sprite: &Sprite) -> StructuralPatterns
}
```
- Pattern detection rules:
- **Outline**: First/last row and first/last column contain `{outline}` or similar token
- **H-Symmetry**: Row[i] == reverse(Row[i]) for all rows
- **V-Symmetry**: Row[i] == Row[height-1-i] for first half of rows
- **Gradient**: 3+ consecutive different tokens in a row/column
- **Transparency border**: Outer ring is `{_}` or transparent token
**Verification:**
```bash
cargo test analyze::structural
# Test: Symmetric sprite detected as symmetric
# Test: Asymmetric sprite detected as not symmetric
# Test: Outline sprite detected
# Test: Gradient pattern detected
```
**Test Fixture:** `tests/fixtures/valid/analyze_patterns.jsonl`
```jsonl
{"type": "palette", "name": "patterns", "colors": {"{_}": "#00000000", "{o}": "#000000", "{a}": "#FF0000", "{b}": "#00FF00", "{c}": "#0000FF"}}
{"type": "sprite", "name": "symmetric", "size": [5, 3], "palette": "patterns", "grid": ["{a}{b}{c}{b}{a}", "{a}{b}{c}{b}{a}", "{a}{b}{c}{b}{a}"]}
{"type": "sprite", "name": "outlined", "size": [4, 4], "palette": "patterns", "grid": ["{o}{o}{o}{o}", "{o}{a}{a}{o}", "{o}{a}{a}{o}", "{o}{o}{o}{o}"]}
{"type": "sprite", "name": "gradient", "size": [4, 1], "palette": "patterns", "grid": ["{a}{b}{c}{_}"]}
```
**Dependencies:** Task 14.1
---
### Task 14.4: Dimensional Analysis
**Wave:** 2 (parallel with 14.2, 14.3)
Track sprite sizes and aspect ratios.
**Deliverables:**
- Add to `src/analyze.rs`:
```rust
pub struct DimensionalAnalyzer {
size_counts: HashMap<(u32, u32), usize>,
}
impl DimensionalAnalyzer {
pub fn new() -> Self
pub fn add_sprite(&mut self, sprite: &Sprite)
pub fn top_sizes(&self, n: usize) -> Vec<((u32, u32), usize)>
pub fn aspect_ratio_distribution(&self) -> HashMap<AspectRatio, usize>
pub fn size_categories(&self) -> SizeCategories
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum AspectRatio {
Square, Portrait, Landscape, }
#[derive(Debug, Default)]
pub struct SizeCategories {
pub tiny: usize, pub small: usize, pub medium: usize, pub large: usize, pub huge: usize, }
```
**Verification:**
```bash
cargo test analyze::dimensional
# Test: Single sprite → correct size recorded
# Test: Multiple sizes → distribution correct
# Test: Aspect ratios categorized correctly
```
**Dependencies:** Task 14.1
---
### Task 14.5: Compression Estimation
**Wave:** 3 (after 14.2)
Calculate potential compression savings from run-length encoding.
**Deliverables:**
- Add to `src/analyze.rs`:
```rust
pub struct CompressionEstimator;
impl CompressionEstimator {
pub fn count_runs(row: &[String]) -> usize
pub fn estimate_rle_savings(sprite: &Sprite) -> RleSavings
pub fn find_repeated_rows(sprite: &Sprite) -> Vec<(usize, usize)>
pub fn analyze_corpus(sprites: &[Sprite]) -> CompressionStats
}
#[derive(Debug)]
pub struct RleSavings {
pub original_tokens: usize,
pub compressed_tokens: usize,
pub ratio: f64,
}
```
- Metrics to calculate:
- Average tokens per row
- Average unique tokens per row
- Average runs per row (consecutive identical tokens)
- Potential compression ratio (original / RLE encoded)
**Verification:**
```bash
cargo test analyze::compression
# Test: Row of identical tokens → high compression
# Test: Row of unique tokens → no compression
# Test: Mixed row → accurate run count
```
**Test Fixture:** `tests/fixtures/valid/analyze_compression.jsonl`
```jsonl
{"type": "palette", "name": "compress", "colors": {"{_}": "#00000000", "{a}": "#FF0000"}}
{"type": "sprite", "name": "compressible", "size": [16, 4], "palette": "compress", "grid": [
"{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}",
"{_}{_}{_}{_}{a}{a}{a}{a}{a}{a}{a}{a}{_}{_}{_}{_}",
"{_}{_}{_}{_}{a}{a}{a}{a}{a}{a}{a}{a}{_}{_}{_}{_}",
"{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}{_}"
]}
```
**Dependencies:** Task 14.2
---
### Task 14.6: Output Formatting & CLI Polish
**Wave:** 4 (after all above)
Complete output formatting and CLI integration.
**Deliverables:**
- Text output format (default):
```
Pixelsrc Analysis Report
========================
Files analyzed: 47
Total sprites: 156
Total palettes: 89
Total compositions: 12
TOKEN FREQUENCY (top 10)
────────────────────────
{_} 45,231 (28.4%)
{outline} 12,847 (8.1%)
{skin} 8,392 (5.3%)
...
TOKEN CO-OCCURRENCE (top 5 pairs)
─────────────────────────────────
{skin} + {skin_shadow} 67 sprites
...
DIMENSIONS
──────────
16x16 67 sprites (43%)
8x8 42 sprites (27%)
...
STRUCTURAL PATTERNS
───────────────────
Has outline: 89 sprites (57%)
Horizontal sym: 34 sprites (22%)
...
RUN-LENGTH OPPORTUNITIES
────────────────────────
Avg tokens/row: 14.3
Potential compression: ~2.5x
```
- JSON output format (`--format json`):
```rust
impl AnalysisReport {
pub fn to_json(&self) -> serde_json::Value
}
```
```json
{
"files_analyzed": 47,
"total_sprites": 156,
"token_frequency": [{"token": "{_}", "count": 45231, "percentage": 28.4}, ...],
"dimensions": [{"size": [16, 16], "count": 67, "percentage": 43}, ...],
...
}
```
- File output (`--output <file>`):
```rust
fn write_report(report: &AnalysisReport, path: &Path, format: OutputFormat) -> Result<()>
```
- Progress indication for large corpora (stderr):
```
Analyzing 100 files... [=====> ] 50%
```
- Error handling:
- Skip invalid files with warning
- Continue on parse errors
- Report count of skipped files
**Verification:**
```bash
# Text output
./target/release/pxl analyze examples/
# Should show formatted report
# JSON output
# File output
./target/release/pxl analyze examples/ --output /tmp/report.txt
cat /tmp/report.txt
# JSON file output
./target/release/pxl analyze examples/ --format json --output /tmp/report.json
jq '.' /tmp/report.json
```
**Dependencies:** Tasks 14.2, 14.3, 14.4, 14.5
---
## Output Example (text format)
```
Pixelsrc Analysis Report
========================
Files analyzed: 47
Total sprites: 156
Total palettes: 89
Total compositions: 12
TOKEN FREQUENCY (top 10)
────────────────────────
{_} 45,231 (28.4%)
{outline} 12,847 (8.1%)
{skin} 8,392 (5.3%)
{shadow} 7,104 (4.5%)
{hair} 5,231 (3.3%)
{white} 4,892 (3.1%)
{black} 4,103 (2.6%)
{eye} 3,847 (2.4%)
{gold} 3,102 (2.0%)
{shine} 2,891 (1.8%)
TOKEN CO-OCCURRENCE (top 5 pairs)
─────────────────────────────────
{skin} + {skin_shadow} 67 sprites
{outline} + {_} 64 sprites
{gold} + {shine} 23 sprites
{hair} + {hair_dark} 21 sprites
{eye} + {pupil} 19 sprites
DIMENSIONS
──────────
16x16 67 sprites (43%)
8x8 42 sprites (27%)
32x32 28 sprites (18%)
24x24 11 sprites (7%)
Other 8 sprites (5%)
STRUCTURAL PATTERNS
───────────────────
Has outline: 89 sprites (57%)
Horizontal sym: 34 sprites (22%)
Vertical sym: 12 sprites (8%)
Uses gradients: 45 sprites (29%)
Transparency border: 78 sprites (50%)
RUN-LENGTH OPPORTUNITIES
────────────────────────
Avg tokens/row: 14.3
Avg unique tokens/row: 4.2
Avg runs/row: 5.8
Potential compression: ~2.5x
PALETTE PATTERNS
────────────────
Avg tokens/palette: 8.3
Uses built-in palette: 23 sprites (15%)
Has highlight/shadow: 67 sprites (43%)
Inline palette: 34 sprites (22%)
```
---
## Verification Summary
```bash
# 1. All previous tests pass
cargo test
# 2. CLI help works
./target/release/pxl analyze --help
# 3. Analyze examples directory
./target/release/pxl analyze examples/
# Should produce meaningful report
# 4. JSON output is valid
# 5. Recursive works
./target/release/pxl analyze --recursive tests/fixtures/
# Should find all .jsonl files
# 6. File output works
./target/release/pxl analyze examples/ --output /tmp/report.txt
cat /tmp/report.txt
# 7. Performance test
time ./target/release/pxl analyze --recursive examples/
# Should complete in < 5s for typical corpus
```
---
## Success Criteria
1. `pxl analyze examples/` produces meaningful report
2. JSON output is machine-parseable for scripting
3. Compression estimates are accurate (validated against manual calculation)
4. Analysis completes in reasonable time (<5s for 100 files)
5. Report provides actionable insights for primitive design
---
## Future Extensions
Not in scope for Phase 14, but potential additions:
| `--compare <baseline.json>` | Diff against previous analysis |
| `--suggest-primitives` | Recommend common patterns as primitives |
| `--visualize` | Output charts (SVG or ASCII) |
| `--watch` | Re-analyze on file changes |
| Token clustering | Group semantically similar tokens |