pith 0.1.0

Generate optimized codebase context for LLMs
Documentation
# Filter Specification

## Purpose

Determine which files should be processed for codemap extraction using a layered filtering approach.

## Filtering Layers

```
     ┌─────────────────────────────────────────┐
     │ 1. Extension blocklist (instant skip)  │
     │    .png .jpg .wasm .so .exe .zip .lock  │
     ├─────────────────────────────────────────┤
     │ 2. Extension allowlist (attempt parse)  │
     │    .rs .ts .tsx .js .jsx .py .go        │
     ├─────────────────────────────────────────┤
     │ 3. Heuristics (before parsing)          │
     │    - Null bytes in first 1KB → binary   │
     │    - Any line > 500 chars → minified    │
     │    - Contains generated markers → skip  │
     └─────────────────────────────────────────┘
```

## Types

### FilterResult

```rust
pub enum FilterResult {
    Accept(Language),
    Reject(RejectReason),
}

pub enum RejectReason {
    BlocklistedExtension,
    UnknownExtension,
    BinaryContent,
    MinifiedContent,
    GeneratedFile,
}
```

### FilterOptions

```rust
pub struct FilterOptions {
    pub additional_extensions: HashMap<String, Language>,
    pub skip_heuristics: bool,
}
```

## Functions

### should_process

```rust
pub fn should_process(path: &Path, content: Option<&[u8]>) -> FilterResult
```

**Preconditions:**
- `path` is a file path
- `content` is first 1KB of file (for heuristics), or None to skip heuristics

**Postconditions:**
- Returns `Accept(language)` if file should be processed
- Returns `Reject(reason)` otherwise

### is_blocklisted

```rust
pub fn is_blocklisted(extension: &str) -> bool
```

**Postconditions:**
- Returns `true` if extension is in blocklist

### detect_language

```rust
pub fn detect_language(path: &Path) -> Option<Language>
```

**Postconditions:**
- Returns `Some(language)` if extension maps to supported language
- Returns `None` otherwise

### is_binary

```rust
pub fn is_binary(content: &[u8]) -> bool
```

**Postconditions:**
- Returns `true` if content contains null bytes

### is_minified

```rust
pub fn is_minified(content: &[u8]) -> bool
```

**Postconditions:**
- Returns `true` if any line exceeds 500 characters

### is_generated

```rust
pub fn is_generated(content: &[u8]) -> bool
```

**Postconditions:**
- Returns `true` if content contains generated file markers

## Extension Blocklist

Extensions that are always rejected:

### Binary Formats
```
.png .jpg .jpeg .gif .webp .ico .svg
.wasm .so .dll .dylib .exe .bin
.zip .tar .gz .bz2 .xz .7z .rar
.pdf .doc .docx .xls .xlsx
.mp3 .mp4 .wav .avi .mov
.ttf .otf .woff .woff2 .eot
```

### Lock Files
```
.lock
package-lock.json
yarn.lock
Cargo.lock
poetry.lock
Gemfile.lock
```

### Other
```
.min.js .min.css
.map
.d.ts (type declarations, optionally include)
```

## Extension Allowlist

Extensions that are processed, mapped to languages:

| Extension | Language |
|-----------|----------|
| `.rs` | Rust |
| `.ts` | TypeScript |
| `.tsx` | Tsx |
| `.js` | JavaScript |
| `.jsx` | Jsx |
| `.mjs` | JavaScript |
| `.cjs` | JavaScript |
| `.py` | Python |
| `.pyi` | Python |
| `.go` | Go |

## Heuristics

### Binary Detection

Check first 1KB for null bytes (0x00):

```rust
pub fn is_binary(content: &[u8]) -> bool {
    content.iter().any(|&b| b == 0)
}
```

**Rationale:** Text files should not contain null bytes. This catches accidentally included binary files.

### Minified Detection

Check for lines exceeding 500 characters:

```rust
pub fn is_minified(content: &[u8]) -> bool {
    content
        .split(|&b| b == b'\n')
        .any(|line| line.len() > 500)
}
```

**Rationale:** Minified JavaScript/CSS has very long lines. These files are not useful for codemaps.

### Generated File Detection

Check for common generated file markers:

```rust
const GENERATED_MARKERS: &[&str] = &[
    "// Code generated",
    "// DO NOT EDIT",
    "# Generated by",
    "/* Auto-generated */",
    "// This file is auto-generated",
    "@generated",
    "// generated from",
];

pub fn is_generated(content: &[u8]) -> bool {
    let text = std::str::from_utf8(content).unwrap_or("");
    GENERATED_MARKERS.iter().any(|marker| text.contains(marker))
}
```

**Rationale:** Generated files (protobuf outputs, codegen) don't need codemaps.

## Full Algorithm

```rust
pub fn should_process(path: &Path, content: Option<&[u8]>) -> FilterResult {
    // Get extension
    let ext = path.extension()
        .and_then(|e| e.to_str())
        .map(|e| e.to_lowercase());

    let ext = match ext {
        Some(e) => e,
        None => return FilterResult::Reject(RejectReason::UnknownExtension),
    };

    // Layer 1: Blocklist
    if is_blocklisted(&ext) {
        return FilterResult::Reject(RejectReason::BlocklistedExtension);
    }

    // Layer 2: Allowlist
    let language = match detect_language(path) {
        Some(lang) => lang,
        None => return FilterResult::Reject(RejectReason::UnknownExtension),
    };

    // Layer 3: Heuristics (if content provided)
    if let Some(content) = content {
        if is_binary(content) {
            return FilterResult::Reject(RejectReason::BinaryContent);
        }

        if is_minified(content) {
            return FilterResult::Reject(RejectReason::MinifiedContent);
        }

        if is_generated(content) {
            return FilterResult::Reject(RejectReason::GeneratedFile);
        }
    }

    FilterResult::Accept(language)
}
```

## Edge Cases

### No Extension
Files without extensions are rejected with `UnknownExtension`.

### Multiple Extensions
Only the last extension matters: `file.test.ts` → `.ts` → TypeScript

### Case Sensitivity
Extensions are compared case-insensitively: `.RS` = `.rs`

### Empty Files
Empty files pass heuristics (no binary content, no long lines, no markers).

### UTF-8 BOM
Files starting with UTF-8 BOM are not considered binary.

### Very Large Files
Only read first 1KB for heuristics. If that passes, proceed with full file for codemap.

## Configuration

### Adding Custom Extensions

```rust
let options = FilterOptions {
    additional_extensions: [
        ("vue".to_string(), Language::TypeScript),
        ("svelte".to_string(), Language::TypeScript),
    ].into(),
    skip_heuristics: false,
};
```

### Skipping Heuristics

For trusted codebases, skip heuristic checks:

```rust
let options = FilterOptions {
    additional_extensions: HashMap::new(),
    skip_heuristics: true,
};
```

## Integration with Walker

Use as a filter in the pipeline:

```rust
walk(path)
    .filter_map(|entry| entry.ok())
    .filter(|entry| entry.file_type == FileType::File)
    .filter_map(|entry| {
        let content = read_first_kb(&entry.path).ok();
        match should_process(&entry.path, content.as_deref()) {
            FilterResult::Accept(lang) => Some((entry, lang)),
            FilterResult::Reject(_) => None,
        }
    })
```

## Examples

### Accepted Files
```
src/main.rs          → Accept(Rust)
lib/utils.ts         → Accept(TypeScript)
components/App.tsx   → Accept(Tsx)
scripts/build.py     → Accept(Python)
cmd/server.go        → Accept(Go)
```

### Rejected Files
```
assets/logo.png      → Reject(BlocklistedExtension)
dist/bundle.min.js   → Reject(BlocklistedExtension)  // .min.js blocklisted
vendor/lib.js        → Reject(MinifiedContent)       // if line > 500 chars
generated/proto.go   → Reject(GeneratedFile)         // contains "// Code generated"
README.md            → Reject(UnknownExtension)
```