# Filter Specification
## Purpose
Determine which files should be processed for codemap extraction using a layered filtering approach.
## Filtering Layers
```
┌─────────────────────────────────────────┐
│ 1. Extension blocklist (instant skip) │
│ .png .jpg .wasm .so .exe .zip .lock │
├─────────────────────────────────────────┤
│ 2. Extension allowlist (attempt parse) │
│ .rs .ts .tsx .js .jsx .py .go │
├─────────────────────────────────────────┤
│ 3. Heuristics (before parsing) │
│ - Null bytes in first 1KB → binary │
│ - Any line > 500 chars → minified │
│ - Contains generated markers → skip │
└─────────────────────────────────────────┘
```
## Types
### FilterResult
```rust
pub enum FilterResult {
Accept(Language),
Reject(RejectReason),
}
pub enum RejectReason {
BlocklistedExtension,
UnknownExtension,
BinaryContent,
MinifiedContent,
GeneratedFile,
}
```
### FilterOptions
```rust
pub struct FilterOptions {
pub additional_extensions: HashMap<String, Language>,
pub skip_heuristics: bool,
}
```
## Functions
### should_process
```rust
pub fn should_process(path: &Path, content: Option<&[u8]>) -> FilterResult
```
**Preconditions:**
- `path` is a file path
- `content` is first 1KB of file (for heuristics), or None to skip heuristics
**Postconditions:**
- Returns `Accept(language)` if file should be processed
- Returns `Reject(reason)` otherwise
### is_blocklisted
```rust
pub fn is_blocklisted(extension: &str) -> bool
```
**Postconditions:**
- Returns `true` if extension is in blocklist
### detect_language
```rust
pub fn detect_language(path: &Path) -> Option<Language>
```
**Postconditions:**
- Returns `Some(language)` if extension maps to supported language
- Returns `None` otherwise
### is_binary
```rust
pub fn is_binary(content: &[u8]) -> bool
```
**Postconditions:**
- Returns `true` if content contains null bytes
### is_minified
```rust
pub fn is_minified(content: &[u8]) -> bool
```
**Postconditions:**
- Returns `true` if any line exceeds 500 characters
### is_generated
```rust
pub fn is_generated(content: &[u8]) -> bool
```
**Postconditions:**
- Returns `true` if content contains generated file markers
## Extension Blocklist
Extensions that are always rejected:
### Binary Formats
```
.png .jpg .jpeg .gif .webp .ico .svg
.wasm .so .dll .dylib .exe .bin
.zip .tar .gz .bz2 .xz .7z .rar
.pdf .doc .docx .xls .xlsx
.mp3 .mp4 .wav .avi .mov
.ttf .otf .woff .woff2 .eot
```
### Lock Files
```
.lock
package-lock.json
yarn.lock
Cargo.lock
poetry.lock
Gemfile.lock
```
### Other
```
.min.js .min.css
.map
.d.ts (type declarations, optionally include)
```
## Extension Allowlist
Extensions that are processed, mapped to languages:
| `.rs` | Rust |
| `.ts` | TypeScript |
| `.tsx` | Tsx |
| `.js` | JavaScript |
| `.jsx` | Jsx |
| `.mjs` | JavaScript |
| `.cjs` | JavaScript |
| `.py` | Python |
| `.pyi` | Python |
| `.go` | Go |
## Heuristics
### Binary Detection
Check first 1KB for null bytes (0x00):
```rust
pub fn is_binary(content: &[u8]) -> bool {
content.iter().any(|&b| b == 0)
}
```
**Rationale:** Text files should not contain null bytes. This catches accidentally included binary files.
### Minified Detection
Check for lines exceeding 500 characters:
```rust
pub fn is_minified(content: &[u8]) -> bool {
content
.split(|&b| b == b'\n')
.any(|line| line.len() > 500)
}
```
**Rationale:** Minified JavaScript/CSS has very long lines. These files are not useful for codemaps.
### Generated File Detection
Check for common generated file markers:
```rust
const GENERATED_MARKERS: &[&str] = &[
"// Code generated",
"// DO NOT EDIT",
"# Generated by",
"/* Auto-generated */",
"// This file is auto-generated",
"@generated",
"// generated from",
];
pub fn is_generated(content: &[u8]) -> bool {
let text = std::str::from_utf8(content).unwrap_or("");
GENERATED_MARKERS.iter().any(|marker| text.contains(marker))
}
```
**Rationale:** Generated files (protobuf outputs, codegen) don't need codemaps.
## Full Algorithm
```rust
pub fn should_process(path: &Path, content: Option<&[u8]>) -> FilterResult {
// Get extension
let ext = path.extension()
.and_then(|e| e.to_str())
.map(|e| e.to_lowercase());
let ext = match ext {
Some(e) => e,
None => return FilterResult::Reject(RejectReason::UnknownExtension),
};
// Layer 1: Blocklist
if is_blocklisted(&ext) {
return FilterResult::Reject(RejectReason::BlocklistedExtension);
}
// Layer 2: Allowlist
let language = match detect_language(path) {
Some(lang) => lang,
None => return FilterResult::Reject(RejectReason::UnknownExtension),
};
// Layer 3: Heuristics (if content provided)
if let Some(content) = content {
if is_binary(content) {
return FilterResult::Reject(RejectReason::BinaryContent);
}
if is_minified(content) {
return FilterResult::Reject(RejectReason::MinifiedContent);
}
if is_generated(content) {
return FilterResult::Reject(RejectReason::GeneratedFile);
}
}
FilterResult::Accept(language)
}
```
## Edge Cases
### No Extension
Files without extensions are rejected with `UnknownExtension`.
### Multiple Extensions
Only the last extension matters: `file.test.ts` → `.ts` → TypeScript
### Case Sensitivity
Extensions are compared case-insensitively: `.RS` = `.rs`
### Empty Files
Empty files pass heuristics (no binary content, no long lines, no markers).
### UTF-8 BOM
Files starting with UTF-8 BOM are not considered binary.
### Very Large Files
Only read first 1KB for heuristics. If that passes, proceed with full file for codemap.
## Configuration
### Adding Custom Extensions
```rust
let options = FilterOptions {
additional_extensions: [
("vue".to_string(), Language::TypeScript),
("svelte".to_string(), Language::TypeScript),
].into(),
skip_heuristics: false,
};
```
### Skipping Heuristics
For trusted codebases, skip heuristic checks:
```rust
let options = FilterOptions {
additional_extensions: HashMap::new(),
skip_heuristics: true,
};
```
## Integration with Walker
Use as a filter in the pipeline:
```rust
walk(path)
.filter_map(|entry| entry.ok())
.filter(|entry| entry.file_type == FileType::File)
.filter_map(|entry| {
let content = read_first_kb(&entry.path).ok();
match should_process(&entry.path, content.as_deref()) {
FilterResult::Accept(lang) => Some((entry, lang)),
FilterResult::Reject(_) => None,
}
})
```
## Examples
### Accepted Files
```
src/main.rs → Accept(Rust)
lib/utils.ts → Accept(TypeScript)
components/App.tsx → Accept(Tsx)
scripts/build.py → Accept(Python)
cmd/server.go → Accept(Go)
```
### Rejected Files
```
assets/logo.png → Reject(BlocklistedExtension)
dist/bundle.min.js → Reject(BlocklistedExtension) // .min.js blocklisted
vendor/lib.js → Reject(MinifiedContent) // if line > 500 chars
generated/proto.go → Reject(GeneratedFile) // contains "// Code generated"
README.md → Reject(UnknownExtension)
```