Module scanner

Expand description

Unified scanner module for repository scanning

This module provides a unified scanner implementation used by both the CLI and language bindings. It includes:

§Architecture

The scanner uses a pipelined architecture for large repositories:

For smaller repositories (< 100 files), a simpler parallel approach is used.

Configurable features include:

FileInfo: Intermediate struct for collecting file info before parallel processing
ScannerConfig: Runtime configuration for repository scanning
UnifiedScanner: Unified scanner for repository scanning

BINARY_EXTENSIONS: List of known binary file extensions
DEFAULT_BATCH_SIZE: Maximum files to process in a single parallel batch to avoid stack overflow
MMAP_THRESHOLD: Threshold for using memory-mapped I/O (files >= 1MB use mmap)
PIPELINE_THRESHOLD: Minimum number of files to trigger pipelined mode

collect_file_infos: Collect file information (paths, sizes) without reading content
collect_file_paths: Collect file paths from a repository, returning only path information
count_tokens: Count tokens using configurable method
count_tokens_accurate: Count tokens using thread-local tokenizer (accurate via tiktoken)
estimate_lines: Estimate lines from file size
estimate_tokens: Estimate tokens from file size
is_binary_content: Check if content appears to be binary by examining bytes
is_binary_extension: Check if a file path has a known binary extension
parse_with_thread_local: Parse content using thread-local parser (lock-free)
process_file_content_only: Process a file with content reading only (no parsing - fast path)
process_file_with_content: Process a file with content reading and parsing (used in parallel)
process_file_without_content: Process a file without reading content (fast path)
scan_files_pipelined: Pipelined file scanning with overlapped I/O and parsing
scan_repository: Scan a repository and return a Repository struct
smart_read_file: Smart file reading that uses mmap for large files
smart_read_file_with_options: Smart file reading with configurable mmap support