code-search-cli 0.3.3

# Bottom-Up Parsing Optimization

## Overview

The **Bottom-Up Parsing Optimization** is a core algorithm in `cs` that dramatically improves performance when searching large translation files (YAML/JSON). Instead of parsing the entire file structure and then filtering results, this algorithm:

1. **Finds exact matches first** using grep (ripgrep library)
2. **Traces key paths upward** from matched lines using indentation/structure
3. **Skips files entirely** if no matches are found

This optimization provides **20-100x speedup** on large files by avoiding expensive full-structure parsing when possible.

## The Problem

Traditional translation file parsing follows a top-down approach:

```
┌─────────────────────────────────────┐
│ 1. Parse entire YAML/JSON structure │  ← Expensive!
├─────────────────────────────────────┤
│ 2. Flatten to key-value pairs       │  ← Memory intensive
├─────────────────────────────────────┤
│ 3. Filter by search query           │  ← Wasteful
└─────────────────────────────────────┘
```

**Problems with this approach:**
- **Parses everything**: Even if searching for a single value in a 644KB file
- **Memory intensive**: Builds entire tree structure in memory
- **No early exit**: Can't skip files without matches until after parsing
- **Slow on large files**: O(n) parsing time regardless of result count

## The Solution: Bottom-Up Tracing

The bottom-up algorithm inverts this logic:

```
┌─────────────────────────────────────┐
│ 1. Grep for exact matches           │  ← Fast! (ripgrep library)
├─────────────────────────────────────┤
│ 2. No matches? DONE!                │  ← Early exit
├─────────────────────────────────────┤
│ 3. For each match: trace key upward │  ← Only parse needed paths
└─────────────────────────────────────┘
```

**Advantages:**
- ✅ **Fast prefilter**: Grep finds matches in milliseconds
- ✅ **Early exit**: No-match files skipped entirely (0.009s on 644KB file)
- ✅ **Minimal parsing**: Only traces key paths for matched values
- ✅ **Low memory**: No full tree structure needed

## Algorithm Details

### Phase 1: Grep Prefilter

Uses the `grep-searcher` crate (same engine as ripgrep) for fast text matching:

```rust
pub fn contains_query(path: &Path, query: &str) -> Result<bool> {
    let matcher = RegexMatcherBuilder::new()
        .case_insensitive(true)
        .fixed_strings(true)  // Literal match, not regex
        .build(query)?;

    let mut searcher = SearcherBuilder::new().build();
    let mut found = false;

    searcher.search_path(&matcher, path, UTF8(|_, _| {
        found = true;
        Ok(false)  // Stop after first match
    }))?;

    Ok(found)
}
```

**Key features:**
- **Case-insensitive**: Matches "Search", "search", "SEARCH"
- **Fixed-string**: Treats query as literal text (faster than regex)
- **Early termination**: Stops at first match for `contains_query()`
- **Line numbers**: Captures exact line numbers for tracing

### Phase 2: Bottom-Up Key Tracing

For each matched line, traces the key path upward using indentation:

```rust
fn trace_key_from_line(
    lines: &[&str],
    line_num: usize,
    path: &Path,
    cutoff_line: usize,
    ancestor_cache: &HashMap<usize, Vec<String>>,
) -> Option<TraceResult>
```

#### YAML Tracing Algorithm

```yaml
# Example file (lines numbered)
1: en:
2:   errors:
3:     not_found: "Page not found"  ← Match at line 3
4:     server_error: "Server error"
```

**Tracing process for line 3:**

1. **Extract key and value from target line**
   ```rust
   line: "    not_found: \"Page not found\""
   key_part: "not_found"
   value: "Page not found"
   target_indent: 4
   ```

2. **Walk upward to find parents**
   ```
   Line 2: indent=2 < 4 → parent! key="errors"
   Line 1: indent=0 < 2 → parent! key="en" (locale root, stop)
   ```

3. **Build key path**
   ```
   key_parts: ["errors", "not_found"]
   final_key: "errors.not_found"
   ```

#### JSON Tracing Algorithm

Similar approach but uses brace/bracket counting instead of indentation:

```json
{
  "errors": {
    "not_found": "Page not found"  ← Match
  }
}
```

**Tracing process:**

1. **Extract key and value**
   ```rust
   line: "    \"not_found\": \"Page not found\""
   key: "not_found"
   value: "Page not found"
   ```

2. **Count braces to find nesting level**
   ```
   Walk upward counting { and }
   When braces balance, found parent object
   ```

3. **Extract parent keys from opening braces**

### Phase 3: Optimization - Ancestor Caching

For non-tangled translation trees (common case), we can optimize repeated tracing:

**Assumption**: Multiple matches for the same value appear in **ascending line order**, and their ancestors are **shared** or appear **after** earlier matches.

**Optimization strategy:**

```rust
// Maintain cutoff line and ancestor cache
let mut cutoff_line: usize = 0;
let mut ancestor_cache: HashMap<usize, Vec<String>> = HashMap::new();

for (line_num, _) in matched_lines {
    if let Some(trace) = trace_key_from_line(
        &lines, 
        line_num, 
        path, 
        cutoff_line, 
        &ancestor_cache
    ) {
        // Cache ancestors for future lookups
        for (line_idx, prefix) in trace.parent_prefixes {
            ancestor_cache.entry(line_idx).or_insert(prefix);
        }
        
        entries.push(trace.entry);
    }
    
    // Monotonic guarantee: next match starts after this one
    cutoff_line = line_num;
}
```

**How it works:**

1. **First match** (line 100): Traces all the way to root, caches ancestors
   ```
   Ancestors: line 50 → ["errors"], line 10 → ["en", "errors"]
   ```

2. **Second match** (line 150): Stops when it hits cached ancestor
   ```
   Walk upward from line 150
   Hit line 50 → Found cached ["errors"]
   Combine: ["errors"] + ["not_found"] = ["errors", "not_found"]
   Skip walking to root (already cached)
   ```

3. **Third match** (line 200): Benefits from both previous caches

**Performance improvement:**
- **First match**: O(n) walk to root
- **Subsequent matches**: O(k) where k = distance to nearest cached ancestor
- **Large files**: Can reduce redundant scans by 50-80%

## Implementation Details

### File Format Support

#### YAML Files
- **Indentation-based**: Uses whitespace to determine nesting
- **ERB template stripping**: Removes `<%= %>` tags for Rails compatibility
- **Malformed line detection**: Skips lines with multiple unquoted colons
- **Locale root detection**: Stops at language codes (en, fr, de, etc.)

```rust
// Skip malformed YAML
if value_part.contains(':') 
    && !value_part.starts_with('"') 
    && !value_part.starts_with('\'') 
{
    return None;  // Skip this line
}

// Stop at locale root
if line_indent == 0 && is_locale_code(parent_key) {
    break;
}
```

#### JSON Files
- **Brace-balanced**: Uses `{` and `}` counting for structure
- **Comment stripping**: Removes `//` and `/* */` for JSONC support
- **Array handling**: Detects and processes array indices
- **Unicode support**: Handles escaped characters properly

### Error Handling

The algorithm is designed to be **fault-tolerant**:

```rust
// Skip problematic lines instead of failing
if colon_pos.is_none() {
    continue;  // No colon? Not a key-value pair
}

if value.is_empty() {
    return None;  // Empty value? Skip it
}

// Malformed structure? Return partial results
if line_num == 0 || line_num > lines.len() {
    return None;
}
```

**Philosophy**: Better to return partial results than fail entirely.

### Integration with Key Extractor

The bottom-up optimization integrates seamlessly with the key extraction system:

```rust
// In key_extractor.rs
match YamlParser::contains_query(path, query) {
    Ok(false) => {
        eprint!("{}", "-".dimmed());  // Visual: skipped
        continue;
    }
    Ok(true) => {
        // Has match, proceed with parsing
        match YamlParser::parse_file_with_query(path, Some(query)) {
            Ok(entries) => {
                eprint!("{}", ".".bright_green());  // Visual: parsed
                results.extend(entries);
            }
        }
    }
}
```

**Visual indicators:**
- `-` (dim): File skipped (no matches found)
- `.` (green): File parsed (matches found)
- `C` (cyan): Results from cache
- `S` (red): Parse error

## Performance Characteristics

### Benchmark Results

Using large fixture files from Discourse project:

```
File: large_server_uk.yml (644KB)
Query: "Search"

Traditional parsing:  0.267s
Bottom-up (cold):     0.267s (first match triggers parse)
Bottom-up (no-match): 0.009s (grep only, 30x faster)

Cache: 
- First run:  0.267s
- Cached:     0.012s (22x speedup)
```

### Complexity Analysis

#### Traditional Top-Down Parsing
```
Time:   O(n) where n = file size
Space:  O(n) for tree structure
Memory: Full YAML/JSON tree in memory
```

#### Bottom-Up Optimization
```
Best case (no matches):
  Time:   O(g) where g = grep scan time (~0.009s)
  Space:  O(1) minimal
  Memory: None allocated

Worst case (many matches):
  Time:   O(g + m*k) where m = matches, k = avg depth
  Space:  O(m*k) for key paths only
  Memory: Minimal - no full tree
  
With ancestor caching:
  Time:   O(g + m*log(k)) amortized
```

### When Bottom-Up Wins

✅ **Large files with few matches**: 10-100x faster
✅ **No-match searches**: 30x faster (grep only)
✅ **AI agent workflows**: Repeated queries benefit from cache
✅ **Deep nesting**: Ancestor cache reduces redundant climbs

### When Traditional Wins

⚠️ **Very small files** (<10KB): Overhead not worth it
⚠️ **All values match**: Must parse everything anyway
⚠️ **Extracting everything**: No query = full parse needed

## Real-World Performance

### Discourse Translation Files

```
Test corpus: Discourse open-source project
Files: 389 YAML files, total ~50MB
Largest: large_server_uk.yml (644KB)

Benchmark: 10 queries on large files
Traditional: ~3.5s total
Bottom-up:   ~0.03s total (117x faster with cache)
```

### AI Agent Workflow

```
Scenario: AI repeatedly searches for UI text
10 queries on 2 large files (1.2MB total)

First run:  0.028s total (grep + parse)
Cached:     0.028s total (mostly grep, minimal parsing)
Per query:  0.003s average

Comparison with ripgrep:
rg alone:   0.009s per query (text only, no key paths)
cs cached:  0.008s per query (text + key paths!)
```

## Limitations and Trade-offs

### Current Limitations

1. **Assumes well-formed files**: Malformed YAML/JSON may produce partial results
2. **Locale root hardcoded**: Only detects common language codes (en, fr, de, etc.)
3. **No fuzzy matching**: Exact substring match only (by design)
4. **Ancestor cache requires ordering**: Assumes matches appear in ascending order

### Design Trade-offs

| Aspect | Choice | Rationale |
|--------|--------|-----------|
| **Grep library** | `grep-searcher` | Same engine as ripgrep, proven fast |
| **Fixed-string matching** | Literal, not regex | Faster, matches user expectations |
| **Case-insensitive** | Always on | Matches UI text expectations |
| **Early termination** | Stop at first match for `contains_query()` | Minimize work for no-match case |
| **Fault tolerance** | Skip bad lines, return partials | Robustness over strict correctness |
| **Ancestor caching** | Optional optimization | Trade memory for speed on large files |

## Future Enhancements

### Planned Improvements

1. **Parallel grep scanning**: Process multiple files concurrently
2. **Memory-mapped I/O**: Reduce memory allocations for very large files
3. **Fuzzy matching**: Optional Levenshtein distance for typos
4. **Custom locale roots**: User-configurable language code detection
5. **Progressive results**: Stream results as found (don't wait for all)

### Research Opportunities

1. **Machine learning for cache prediction**: Predict which ancestors to cache
2. **Adaptive algorithm selection**: Choose traditional vs bottom-up based on heuristics
3. **Incremental parsing**: Parse only changed portions on file updates
4. **Distributed search**: Split large files across workers

## Related Documentation

- **[CACHING.md](./CACHING.md)**: Two-tier caching architecture
- **[STRATEGY.md](./STRATEGY.md)**: Overall project vision and roadmap
- **[README.md](./README.md)**: User-facing documentation and usage examples

## References

### Code Locations

- **YAML parser**: `src/parse/yaml_parser.rs`
  - `contains_query()`: Line 15
  - `parse_with_bottom_up_trace()`: Line 102
  - `trace_key_from_line()`: Line 167
  
- **JSON parser**: `src/parse/json_parser.rs`
  - Same structure as YAML parser
  - Adapted for JSON brace-balanced syntax
  
- **Key extractor**: `src/parse/key_extractor.rs`
  - Integration with grep prefilter: Line 145
  - Visual progress indicators: Line 151, 158

### External Dependencies

- **grep-searcher**: Core grep functionality
- **grep-regex**: Regex matcher builder
- **grep-matcher**: Matcher trait definitions
- **yaml-rust**: YAML parsing (fallback only)
- **serde_json**: JSON parsing (fallback only)

---

*This optimization is the result of profiling real-world usage with large Discourse translation files. The algorithm is designed for the common case: searching for specific UI text in large translation files where most files don't contain the query.*