Expand description
Unified scanner module for repository scanning
This module provides a unified scanner implementation used by both the CLI and language bindings. It includes:
ScannerConfig: Configuration for scanning behaviorFileInfo: Intermediate file metadata during scanningUnifiedScanner: Main scanner with configurable features- Binary detection utilities
§Architecture
The scanner uses a pipelined architecture for large repositories:
- Walk phase: Collect file paths with the
ignorecrate - Read phase: Multiple reader threads read file contents
- Parse phase: Parser threads extract symbols in parallel
- Aggregate phase: Collect results into final Repository
For smaller repositories (< 100 files), a simpler parallel approach is used.
§Features
Configurable features include:
- Memory-mapped I/O for large files (>= 1MB)
- Accurate tiktoken tokenization vs fast estimation
- Pipelined vs simple parallel processing
- Batch processing to prevent stack overflow
Structs§
- File
Info - Intermediate struct for collecting file info before parallel processing
- Scanner
Config - Runtime configuration for repository scanning
- Unified
Scanner - Unified scanner for repository scanning
Constants§
- BINARY_
EXTENSIONS - List of known binary file extensions
- DEFAULT_
BATCH_ SIZE - Maximum files to process in a single parallel batch to avoid stack overflow
- MMAP_
THRESHOLD - Threshold for using memory-mapped I/O (files >= 1MB use mmap)
- PIPELINE_
THRESHOLD - Minimum number of files to trigger pipelined mode
Functions§
- collect_
file_ infos - Collect file information (paths, sizes) without reading content
- collect_
file_ paths - Collect file paths from a repository, returning only path information
- count_
tokens - Count tokens using configurable method
- count_
tokens_ accurate - Count tokens using thread-local tokenizer (accurate via tiktoken)
- estimate_
lines - Estimate lines from file size
- estimate_
tokens - Estimate tokens from file size
- is_
binary_ content - Check if content appears to be binary by examining bytes
- is_
binary_ extension - Check if a file path has a known binary extension
- parse_
with_ thread_ local - Parse content using thread-local parser (lock-free)
- process_
file_ content_ only - Process a file with content reading only (no parsing - fast path)
- process_
file_ with_ content - Process a file with content reading and parsing (used in parallel)
- process_
file_ without_ content - Process a file without reading content (fast path)
- scan_
files_ pipelined - Pipelined file scanning with overlapped I/O and parsing
- scan_
repository - Scan a repository and return a Repository struct
- smart_
read_ file - Smart file reading that uses mmap for large files
- smart_
read_ file_ with_ options - Smart file reading with configurable mmap support