Module scanner

Module scanner 

Source
Expand description

Unified scanner module for repository scanning

This module provides a unified scanner implementation used by both the CLI and language bindings. It includes:

  • ScannerConfig: Configuration for scanning behavior
  • FileInfo: Intermediate file metadata during scanning
  • UnifiedScanner: Main scanner with configurable features
  • Binary detection utilities

§Architecture

The scanner uses a pipelined architecture for large repositories:

  1. Walk phase: Collect file paths with the ignore crate
  2. Read phase: Multiple reader threads read file contents
  3. Parse phase: Parser threads extract symbols in parallel
  4. Aggregate phase: Collect results into final Repository

For smaller repositories (< 100 files), a simpler parallel approach is used.

§Features

Configurable features include:

  • Memory-mapped I/O for large files (>= 1MB)
  • Accurate tiktoken tokenization vs fast estimation
  • Pipelined vs simple parallel processing
  • Batch processing to prevent stack overflow

Structs§

FileInfo
Intermediate struct for collecting file info before parallel processing
ScannerConfig
Runtime configuration for repository scanning
UnifiedScanner
Unified scanner for repository scanning

Constants§

BINARY_EXTENSIONS
List of known binary file extensions
DEFAULT_BATCH_SIZE
Maximum files to process in a single parallel batch to avoid stack overflow
MMAP_THRESHOLD
Threshold for using memory-mapped I/O (files >= 1MB use mmap)
PIPELINE_THRESHOLD
Minimum number of files to trigger pipelined mode

Functions§

collect_file_infos
Collect file information (paths, sizes) without reading content
collect_file_paths
Collect file paths from a repository, returning only path information
count_tokens
Count tokens using configurable method
count_tokens_accurate
Count tokens using thread-local tokenizer (accurate via tiktoken)
estimate_lines
Estimate lines from file size
estimate_tokens
Estimate tokens from file size
is_binary_content
Check if content appears to be binary by examining bytes
is_binary_extension
Check if a file path has a known binary extension
parse_with_thread_local
Parse content using thread-local parser (lock-free)
process_file_content_only
Process a file with content reading only (no parsing - fast path)
process_file_with_content
Process a file with content reading and parsing (used in parallel)
process_file_without_content
Process a file without reading content (fast path)
scan_files_pipelined
Pipelined file scanning with overlapped I/O and parsing
scan_repository
Scan a repository and return a Repository struct
smart_read_file
Smart file reading that uses mmap for large files
smart_read_file_with_options
Smart file reading with configurable mmap support