scribe-scanner
High-performance file system scanning and indexing for Scribe repository analysis.
Overview
scribe-scanner is the foundational crate responsible for efficiently traversing repositories, filtering files, detecting languages, and building the initial file metadata that feeds into Scribe's analysis pipeline. It handles repositories of any size—from small projects to enterprise codebases with 100k+ files.
Key Features
Fast Repository Traversal
- Parallel scanning using
rayonfor multi-core file system traversal - Smart filtering respects
.gitignore,.scribeignore, and custom patterns - Early pruning of excluded directories to minimize filesystem operations
- Metadata caching with xxhash-based content signatures for incremental updates
Language Detection
- Extension-based detection for 20+ programming languages
- Heuristic analysis for ambiguous files (e.g., detecting shell scripts without extensions)
- Language metrics including LOC, comment density, and complexity estimates
- Tree-sitter integration for AST-based language validation
File Classification
- ContentType detection: Code, documentation, configuration, data, test files
- Template detection: Identifies boilerplate and auto-generated files
- Test file recognition: Links test files to their corresponding source files
- Documentation scoring: Prioritizes README, design docs, and API documentation
Git Integration
- Commit history analysis for file churn metrics (change frequency and recency)
- Blame data extraction for understanding file evolution
- Branch awareness for selective scanning
Architecture
Repository → Scanner → FileMetadata → Analysis Pipeline
↓ ↓ ↓
.gitignore Filter Language
Patterns Engine Detection
↓ ↓
Ignore AST Parser
Rules (tree-sitter)
Core Components
RepositoryScanner
Main entry point for repository traversal. Configurable scanning options:
- Token budgets and file size limits
- Include/exclude patterns (glob and regex)
- Test file inclusion/exclusion
- Parallel scanning thread count
FileMetadata
Rich metadata structure containing:
- Path, size, modification time
- Language detection results
- Content type classification
- Git churn metrics
- Token estimates
- Importance scoring hints
IgnoreEngine
Handles pattern matching for file exclusion:
.gitignoreparsing usingignorecrate- Custom
.scribeignorepatterns - Binary file detection and exclusion
- Size-based filtering
LanguageDetector
Determines file language and characteristics:
- Extension mapping to languages
- Shebang detection for scripts
- Tree-sitter parser availability checking
- Language-specific metrics
Usage
Basic Scanning
use ;
let config = ScanConfig ;
let scanner = new;
let files = scanner.scan.await?;
println!;
for file in files
Custom Patterns
use ;
let mut config = default;
config.exclude_patterns = new;
config.include_patterns = new;
let scanner = new;
Git Churn Analysis
use ChurnAnalyzer;
let analyzer = new?;
let churn_data = analyzer.analyze_file?;
println!;
println!;
println!;
Performance
Targets
- Small repos (≤1k files): <500ms scan time, <50MB memory
- Medium repos (1k-10k): <3s scan time, <200MB memory
- Large repos (10k-100k): <15s scan time, <1GB memory
- Enterprise (100k+): <30s scan time, <2GB memory
Optimizations
- Parallel traversal: Multi-threaded directory walking
- Early filtering: Prune excluded directories before scanning
- Lazy AST parsing: Only parse files when needed for analysis
- Incremental caching: Content-based signatures avoid re-scanning unchanged files
- String interning: Reduce memory overhead for repeated path components
Configuration
ScanConfig Options
| Field | Type | Default | Description |
|---|---|---|---|
root_path |
PathBuf |
"." |
Repository root directory |
max_file_size |
usize |
1MB | Skip files larger than this |
exclude_tests |
bool |
false |
Exclude test files from scan |
follow_symlinks |
bool |
false |
Follow symbolic links |
include_patterns |
PatternSet |
Empty | Glob patterns for inclusion |
exclude_patterns |
PatternSet |
Empty | Glob patterns for exclusion |
max_depth |
Option<usize> |
None | Maximum directory depth |
parallel_threads |
usize |
CPU count | Scanner thread pool size |
Integration
scribe-scanner is designed as a foundational crate used by higher-level components:
- scribe-analysis: Consumes
FileMetadatafor AST parsing and import extraction - scribe-graph: Uses scan results to build dependency graphs
- scribe-selection: Applies scoring to scanned files for selection decisions
- scribe-scaling: Optimizes scan performance for large repositories
See Also
scribe-patterns: Advanced pattern matching and glob supportscribe-analysis: AST parsing and semantic analysisscribe-core: Shared types and configuration../../ARCHITECTURE.md: Overall system design